Densley-packed analyte layers and detection methods

ABSTRACT

Disclosed herein are methods and systems for detection and discrimination of optical signals from a densely packed substrate. These have broad applications for biomolecule detection near or below the diffraction limit of optical systems, including in improving the efficiency and accuracy of polynucleotide sequencing applications.

CROSS-REFERENCE

This application is a continuation of PCT/US2019/051796, filed Sep. 18,2019, which claims priority to U.S. Provisional Patent Application No.62/733,525, filed Sep. 19, 2018, both of which are incorporated hereinby reference in their entireties.

BACKGROUND

Affordable, rapid sequencing is causing a revolution in medicine andhealthcare globally. As price of a genome has dropped dramatically sincethe first human genome was sequenced in year 2000, a significantmilestone, $1,000 genome, was recently achieved. However, there is hugedemand for lower cost sequencing that can enable applications such aslarge population sequencing, disease screening and early detection.

A standard for measuring the cost of sequencing is the price of a 30×human genome, defined as 90 gigabases. The major cost components forsequencing systems are primarily the consumables which include biochipand reagents and secondarily the instrument costs.

SUMMARY

Recognized herein is the need for improved imaging methods and improvedsubstrates and methods of producing substrates with dense layers ofanalytes. For example, to reach a $10 30× genome, a 100-fold costreduction, it may be desirable to increase the amount of data per chipunit area by 100-fold and/or decrease the amount of reagent per datapoint by 100-fold. Thus, it may be desirable to provide chips with highdensity deoxyribonucleic acid (DNA) packing that can be imaged by asuper-resolution imaging system.

In an example 1,000 genome platform with cluster densities of tenmillion molecules per square centimeter, each molecule occupies onaverage 10 um² of chip area. Thus, the average effective pitch is 3,160nm. If densities increase 100-fold, for the same chip area and reagent,a 100-fold more information may be obtained resulting in 100-foldreduction in costs. At 100-fold higher density, the new pitch may needto be 320 nm.

Thus, to reduce sequencing cost, it is essential to achieve high densitypacking of target DNA molecules distributed on a surface of a low-costbiochip. However, the fluorescent signals generated from these denselypacked molecules during sequencing need to be resolved by asuper-resolution optical imaging system that can resolve optical signalsbelow the diffraction limit of light.

Furthermore, although other methods exist that are not constrained bythe diffraction limit of optical signals, such as electrical basedsystems developed by companies such as Ion Torrent and Oxford Nanopore,the lowest sequencing costs of all existing technologies may be achievedby optical based systems through the combination of high throughputimaging and low cost consumables.

Affordable, rapid sequencing may be causing a revolution in medicine andhealthcare globally. As price of a genome has dropped dramatically sincethe first human genome was sequenced in year 2000, a significantmilestone, $1,000 genome, was recently achieved. However, recognizedherein is a huge demand for lower cost sequencing that can enableapplications such as large population sequencing, disease screening andearly detection. The present disclosure provides methods and systems toachieve a $10 genome in a substantially contracted time frame. At thisprice point, it may be economical to sequence every newborn and mayremove the cost barrier for deep sequencing and single cell analysis.

A standard for measuring the cost of sequencing is the price of a 30×human genome, defined as 90 gigabases. The major cost components forsequencing systems may be primarily associated with the consumables,which may include a biochip and reagents, and secondarily the instrumentcosts. Thus, to reach a $10 30× genome, a 100-fold cost reduction, itmay be desirable to increase the amount of data per chip unit area by100-fold and/or decrease the amount of reagent per data point by100-fold. Thus, it may be desirable to provide chips with high densityDNA packing that can be imaged by a super-resolution imaging system.

In an example 1,000 genome platform with cluster densities of tenmillion molecules per square centimeter, each molecule occupies onaverage 10 micrometers (um²) of chip area. Thus, the average effectivepitch is 3,160 nanometers (nm). If densities increase 100-fold, for thesame chip area and reagent, a 100-fold more information may be obtainedresulting in 100-fold reduction in costs. At 100-fold higher density,the new pitch may need to be 320 nm.

Thus, to reduce sequencing cost, it may be essential to achieve highdensity packing of target DNA molecules distributed on a surface of alow-cost biochip. However, the fluorescent signals generated from thesedensely packed molecules during sequencing may need to be resolved by asuper-resolution optical imaging system that can resolve optical signalsbelow the diffraction limit of light.

Furthermore, although other methods exist that are not constrained bythe diffraction limit of optical signals, such as electrical basedsystems developed by companies such as Ion Torrent and Oxford Nanopore,the lowest sequencing costs of all existing technologies are achieved byoptical based systems through the combination of high throughput imagingand low cost consumables.

The present disclosure provides methods and systems for analytedetection, such as nucleic acid sequencing (e.g., to achieve a S10genome in a substantially contracted time frame). Methods and systems ofthe present disclosure may be used to identify a nucleic acid molecule,such as DNA or a ribonucleic acid (RNA) molecule, a polypeptide, and/ora protein. At a price point that may be achieved using methods andsystems of the present disclosure, it may be economical to sequenceevery newborn and may remove the cost barrier for deep sequencing andsingle cell analysis.

An aspect of the present disclosure comprises a method for sequencing aplurality of analytes disposed at high density on a surface of asubstrate, comprising: providing a substrate comprising a surface,wherein the surface comprises a plurality of analytes disposed on thesurface at a density such that a minimum effective pitch between bindinglocations of analytes of said plurality of analytes is less thanλ/(2*NA), wherein ‘NA’ is a numerical aperture of said optical imagingmodule, and wherein said surface comprises reagents for sequencing bysynthesis; performing a plurality of cycles of probe binding to saidplurality of analytes, a cycle of said plurality of cycles comprising:contacting said plurality of analytes with a plurality of probes, aprobe of said plurality of probes comprising a detectable label; (ii)imaging a field of said surface with an optical system to detect anoptical signal from each probe brought in contact with said plurality ofanalytes, thereby detecting a plurality of optical signals in said fieldfor said cycle; determining a peak location from each of said pluralityof optical signals from images of said field from at least two of saidplurality of cycles; overlaying said peak locations for each opticalsignal and applying an optical distribution model at each cluster ofoptical signals to determine a relative position of each detected probeon said surface; resolving said optical signals in each field image fromeach cycle using said determined relative position and a resolvingfunction; identifying said detectable labels for each field and eachcycle from said deconvolved optical signals; and identifying analytesdisposed on the surface of the substrate from said identified detectablelabels across said plurality of cycles at each analyte position. In someembodiments, concatemers are loaded on the surface and closely packed toenable a center to center distance of—250 nanometers (nm) with avariance of +/−25 nm. In some embodiments, the average center-to-centerdistance between molecules of about 315 nm. In some embodiments, theplurality of analytes (e.g., nucleic acid molecules) may be depositedadjacent to a surface such that adjacent analytes of the plurality ofanalytes may have average center-to-center spacings of at least 10nanometers (nm), 50 nm, 100 nm, 110 nm, 120 nm, 130 nm, 140 nm, 150 nm,160 nm, 170 nm, 180 nm, 190 nm, 200 nm, 210 nm, 220 nm, 230 nm, 240 nm,250 nm, 260 nm, 270 nm, 280 nm, 290 nm, 300 nm, 310 nm, 320 nm, 330 nm,340 nm, 350 nm, 360 nm, 370 nm, 380 nm, 390 nm, 400 nm, 410 nm, 420 nm,430 nm, 440 nm, 450 nm, 460 nm, 470 nm, 480 nm, 490 nm, 500 nm, or more.The average center-to-center spacings may be less than or equal to 500nm, 490 nm, 480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410nm, 400 nm, 390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320nm, 310 nm, 300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230nm, 220 nm, 210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140nm, 130 nm, 120 nm, 110 nm, 100 nm, 50 nm, or less. The plurality ofanalytes may be nucleic acid molecules (DNA and/or RNA), proteins and/orpolypeptides. The plurality of analytes may be disposed adjacent to asurface such that an individual analyte of the plurality of analytes maybe resolved (e.g., optically resolved). The plurality of analytes may bedisposed adjacent to the surface such that adjacent analytes of theplurality of analytes do not touch or contact each other. In someembodiments, said surface is unpatterned. In some embodiments, saidsurface is patterned. In some embodiments, one or more analytes of saidplurality of analytes are treated with a repellant or attractivesubstance. In some embodiments, said repellant or attractive substancecomprises zwitterionic features. In some embodiments, said repellant orattractive substance comprises PEG, a polysaccharide, ampholineampholytes, sulphobetaine, and/or BSA. In some embodiments, saidanalytes are DNA concatemers. In some embodiments, said DNA concatemersare hybridized to ssDNA hairs. In some embodiments, said analytes areproteins or peptides. In some embodiments, said probes comprise aplurality of reversible terminator nucleotides. In some embodiments,said plurality of reversible terminator nucleotides comprises at leastfour distinct nucleotides each with a distinct detectable label. In someembodiments, said resolving comprises removing interfering opticalsignals from neighboring polynucleotides using a center-to-centerdistance between said neighboring polynucleotides from said determinedrelative positions. In some embodiments, said resolving functioncomprises machine learning. In some embodiments, said resolving functioncomprises nearest neighbor variable regression. In some embodiments,said polynucleotides are densely packed on said substrate such thatthere is overlap between optical signals emitted by said detectablelabels from nucleotides incorporated into adjacent polynucleotides, andwherein said adjacent polynucleotides each comprise a distinct sequence.In some embodiments, the polynucleotides are deposited on said surfaceat an average density of more than 4 molecules per square micron. Insome embodiments, said imaging of said surface is performed at aresolution of one pixel per 300 nm or higher along an axis of the imagefield. In some embodiments, an optical imagining module is configured toobtain said plurality of optical signals at a resolution of one pixelper 250 nanometers or higher. In some embodiments, an optical imaginingmodule is configured to obtain said plurality of optical signals at aresolution of one pixel per 200 nanometers or higher. In someembodiments, an optical imagining module is configured to obtain saidplurality of optical signals at a resolution of one pixel per 150nanometers or higher. In some embodiments, an optical imagining moduleis configured to obtain said plurality of optical signals at aresolution of one pixel per 100 nanometers or higher. In someembodiments, the method further comprises generating an oversampledimage with a higher pixel density from each of said field images fromeach cycle. In some embodiments, said overlaying said peak locationscomprises aligning positions of said optical signal peaks detected ineach field for a plurality of said cycles to generate a cluster ofoptical peak positions for each polynucleotide from said plurality ofcycles. In some embodiments, said overlaying said peak locationscomprises aligning positions of said optical signal peaks detected ineach field for a subset of said cycles to generate a cluster of opticalpeak positions for each polynucleotide from said subset of cycles. Insome embodiments, said optical distribution model comprises a pointspread function. In some embodiments, said relative position of saidanalytes deposited to the surface of the substrate is determined within10 nm RMS.

Another aspect of the present disclosure comprises a method foraccurately determining a relative position of analytes deposited on asurface of a densely packed substrate, comprising: providing a substratecomprising a surface, wherein the surface comprises a plurality ofanalytes deposited on the surface at discrete locations; performing aplurality of cycles of probe binding and signal detection on saidsurface, each cycle comprising: contacting said analytes with aplurality of probes from a probe set, wherein said probes comprise adetectable label, wherein each of said probes binds specifically to atarget analyte; and imaging a field of said surface with an opticalsystem to detect a plurality of optical signals from individual probesbound to said analytes at discrete locations on said surface;determining a peak location from each of said plurality of opticalsignals from images of said field from at least two of said plurality ofcycles; and overlaying said peak locations for each optical signal andapplying an optical distribution model at each cluster of opticalsignals to determine a relative position of each detected analyte onsaid surface with improved accuracy. In some embodiments, concatemersare loaded on the surface and closely packed to enable a center tocenter distance of—250 nm with a variance of +/−25 nm. In someembodiments, the average center-to-center distance between molecules ofabout 315 nm. In some embodiments, the plurality of analytes (e.g.,nucleic acid molecules) may be deposited adjacent to a surface such thatadjacent analytes of the plurality of analytes may have averagecenter-to-center spacings of at least 10 nanometers (nm), 50 nm, 100 nm,110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm, 180 nm, 190 nm,200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm, 270 nm, 280 nm,290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm, 360 nm, 370 nm,380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm, 450 nm, 460 nm,470 nm, 480 nm, 490 nm, 500 nm, or more. The average center-to-centerspacings may be less than or equal to 500 nm, 490 nm, 480 nm, 470 nm,460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm, 390 nm, 380 nm,370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm, 300 nm, 290 nm,280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm, 210 nm, 200 nm,190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm, 120 nm, 110 nm,100 nm, 50 nm, or less. In some embodiments, said surface isunpatterned. In some embodiments, said surface is patterned. In someembodiments, the method further comprises: resolving said opticalsignals in each field image from each cycle using said determinedrelative position and a resolving function; and identifying saiddetectable labels bound to said deposited analytes for each field andeach cycle from said deconvolved optical signals. In some embodiments,one or more analytes of said plurality of analytes are treated with arepellant or attractive substance. In some embodiments, said repellantor attractive substance comprises zwitterionic features. In someembodiments, said repellant or attractive substance comprises PEG, apolysaccharide, ampholine ampholytes, sulphobetaine, and/or BSA. In someembodiments, said analytes are DNA concatemers. In some embodiments,said DNA concatemers are hybridized to ssDNA hairs. In some embodiments,said analytes are proteins or peptides. In some embodiments, the methodfurther comprises using said detectable label identity for each analytedetected at each cycle to identify a plurality of said analytes on saidsubstrate. In some embodiments, said resolving comprises removinginterfering optical signals from neighboring analytes using acenter-to-center distance between said neighboring analytes from saiddetermined relative positions of said neighboring analytes. In someembodiments, said resolving function comprises machine learning. In someembodiments, said resolving function comprises nearest neighbor variableregression. In some embodiments, said analytes are single biomolecules.In some embodiments, said analytes deposited on said surface are spacedapart on average less than the diffraction limit of the light emitted bythe detectable labels and imaged by the optical system. In someembodiments, the deposited analytes comprises an averagecenter-to-center distance between each analyte and the nearest adjacentanalyte of less than 500 nm. In some embodiments, said overlaying saidpeak locations comprises aligning positions of said optical signal peaksdetected in each field for a plurality of said cycles to generate acluster of optical peak positions for each analyte from said pluralityof cycles. In some embodiments, said relative position is determinedwith an accuracy of within 10 nm RMS. In some embodiments, said methodresolves optical signals from a surface at a density of about 4 to about25 analytes per square micron.

Another aspect of the present disclosure comprises a system fordetermining the identity of a plurality of analytes, comprising anoptical imaging device configured to image a plurality of opticalsignals from a field of a substrate over a plurality of cycles of probebinding to analytes deposited on a surface of the substrate; and animage processing module, said module configured to: determine a peaklocation from each of said plurality of optical signals from images ofsaid field from at least two of said plurality of cycles; determine arelative position of each detected analyte on said surface with improvedaccuracy by applying an optical distribution model to each cluster ofoptical signals from said plurality of cycles; and deconvolve saidoptical signals in each field image from each cycle using saiddetermined relative position and a resolving function. In someembodiments, concatemers are loaded on the surface and closely packed toenable a center to center distance of —250 nm with a variance of +/−25nm. In some embodiments, the average center-to-center distance betweenmolecules of about 315 nm. In some embodiments, the plurality ofanalytes (e.g., nucleic acid molecules) may be deposited adjacent to asurface such that adjacent analytes of the plurality of analytes mayhave average center-to-center spacings of at least 10 nanometers (nm),50 nm, 100 nm, 110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm,180 nm, 190 nm, 200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm,270 nm, 280 nm, 290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm,360 nm, 370 nm, 380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm,450 nm, 460 nm, 470 nm, 480 nm, 490 nm, 500 nm, or more. The averagecenter-to-center spacings may be less than or equal to 500 nm, 490 nm,480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm,390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm,300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm,210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm,120 nm, 110 nm, 100 nm, 50 nm, or less. In some embodiments, saidsurface is patterned. In some embodiments, said surface is unpatterned.In some embodiments, said surface is patterned. In some embodiments,said image processing module is further configured to determine anidentity of said analytes deposited on said surface using saiddeconvolved optical signals. In some embodiments, said optical imagedevice comprises a moveable stage defining a scannable area. In someembodiments, said optical image device comprises a sensor and opticalmagnification configured to sample a surface of a substrate at below thediffraction limit in said scannable area. In some embodiments, thesystem further comprises a substrate comprising analytes deposited to asurface of the substrate at a center-to-center spacing below thediffraction limit. In some embodiments, said resolving comprisesremoving interfering optical signals from neighboring analytes using acenter-to-center distance between said neighboring analytes to determinesaid relative positions of said neighboring analytes. In someembodiments, said surface is unpatterned. In some embodiments, saidsurface is patterned.

Another aspect of the present disclosure comprises a method forprocessing or analyzing a plurality of analytes, comprising: disposingsaid plurality of analytes adjacent to a surface of a substrate at adensity wherein a minimum effective pitch is less than a measure ofλ/(2*NA); obtaining a plurality of optical signals from said substrateover one or more cycles of probes binding to analytes of said pluralityof analytes disposed adjacent to said substrate, wherein at least asubset of said plurality of optical signals overlap, which plurality ofoptical signals comprise light having a wavelength (k); applying animaging algorithm to process said plurality of optical signals toidentify a position of an analyte of said plurality of analytes or arelative position of said analyte with respect to another analyte ofsaid plurality of analytes; and using said positions or relativepositions to identify said analytes of said plurality of analytes. Insome embodiments, concatemers are loaded on the surface and closelypacked to enable a center to center distance of—250 nm with a varianceof +/−25 nm. In some embodiments, the average center-to-center distancebetween molecules of about 315 nm. In some embodiments, the plurality ofanalytes (e.g., nucleic acid molecules) may be deposited adjacent to asurface such that adjacent analytes of the plurality of analytes mayhave average center-to-center spacings of at least 10 nanometers (nm),50 nm, 100 nm, 110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm,180 nm, 190 nm, 200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm,270 nm, 280 nm, 290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm,360 nm, 370 nm, 380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm,450 nm, 460 nm, 470 nm, 480 nm, 490 nm, 500 nm, or more. The averagecenter-to-center spacings may be less than or equal to 500 nm, 490 nm,480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm,390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm,300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm,210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm,120 nm, 110 nm, 100 nm, 50 nm, or less. In some embodiments, saidsurface is unpatterned. In some embodiments, said surface is patterned.In some embodiments, one or more analytes of said plurality of analytesare treated with a repellant or attractive substance. In someembodiments, said repellant or attractive substance compriseszwitterionic features. In some embodiments, said repellant or attractivesubstance comprises PEG, a polysaccharide, ampholine ampholytes,sulphobetaine, and/or BSA. In some embodiments, said analytes are DNAconcatemers. In some embodiments, said DNA concatemers are hybridized tossDNA hairs. In some embodiments, said analytes are proteins orpeptides. In some embodiments, step (b) further comprises configuring anoptical processing module to overlay said plurality of optical signalsfrom said one or more cycles of probes binding to analytes and step (c)further comprises applying an optical distribution model said overlay ofsaid plurality of optical signals to determine a relative position ofeach detected analyte. In some embodiments, said imaging algorithmcomprises a resolving function. In some embodiments, said resolvingfunction comprises machine learning. In some embodiments, said resolvingfunction comprises nearest neighbor variable regression. In someembodiments, said resolving function comprises removing interferingoptical signals from neighboring analytes using a center-to-centerdistance between said neighboring analytes. In some embodiments, saidplurality of analytes are disposed adjacent to said substrate at adensity of about 1 to 25 molecules per square micron. In someembodiments, an optical imagining module is configured to obtain saidplurality of optical signals at a resolution of one pixel per 300nanometers or higher. In some embodiments, an optical imagining moduleis configured to obtain said plurality of optical signals at aresolution of one pixel per 250 nanometers or higher. In someembodiments, an optical imagining module is configured to obtain saidplurality of optical signals at a resolution of one pixel per 200nanometers or higher. In some embodiments, an optical imagining moduleis configured to obtain said plurality of optical signals at aresolution of one pixel per 150 nanometers or higher. In someembodiments, an optical imagining module is configured to obtain saidplurality of optical signals at a resolution of one pixel per 100nanometers or higher.

Another aspect of the present disclosure comprises a method ofcontrolling a distribution of an average minimum center-to-centerdistance between analytes of a plurality of analytes deposited on asurface, said method comprising treating said one or more analytes witha repellant or attractive substance. In some embodiments, concatemersare loaded on the surface and closely packed to enable a center tocenter distance of—250 nm with a variance of +/−25 nm. In someembodiments, the average center-to-center distance between molecules ofabout 315 nm. In some embodiments, the plurality of analytes (e.g.,nucleic acid molecules) may be deposited adjacent to a surface such thatadjacent analytes of the plurality of analytes may have averagecenter-to-center spacings of at least 10 nanometers (nm), 50 nm, 100 nm,110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm, 180 nm, 190 nm,200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm, 270 nm, 280 nm,290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm, 360 nm, 370 nm,380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm, 450 nm, 460 nm,470 nm, 480 nm, 490 nm, 500 nm, or more. The average center-to-centerspacings may be less than or equal to 500 nm, 490 nm, 480 nm, 470 nm,460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm, 390 nm, 380 nm,370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm, 300 nm, 290 nm,280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm, 210 nm, 200 nm,190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm, 120 nm, 110 nm,100 nm, 50 nm, or less. In some embodiments, said surface isunpatterned. In some embodiments, said surface is patterned. In someembodiments, said repellant or attractive substance compriseszwitterionic features. In some embodiments, said repellant or attractivesubstance comprises PEG, a polysaccharide, ampholine ampholytes,sulphobetaine, and/or BSA. In some embodiments, said analytes are DNAconcatemers. In some embodiments, said DNA concatemers are hybridized tossDNA hairs. In some embodiments, said analytes are proteins orpeptides. In some embodiments, said average minimum center-to-centerdistance between one or more analytes of a plurality of analytes is lessthan about 500 nm. In some embodiments, said average minimumcenter-to-center distance between one or more analytes of a plurality ofanalytes is about 315 nm. In some embodiments, the plurality of analytes(e.g., nucleic acid molecules) may be deposited adjacent to a surfacesuch that adjacent analytes of the plurality of analytes may haveaverage center-to-center spacings of at least 10 nanometers (nm), 50 nm,100 nm, 110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm, 180 nm,190 nm, 200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm, 270 nm,280 nm, 290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm, 360 nm,370 nm, 380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm, 450 nm,460 nm, 470 nm, 480 nm, 490 nm, 500 nm, or more. The averagecenter-to-center spacings may be less than or equal to 500 nm, 490 nm,480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm,390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm,300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm,210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm,120 nm, 110 nm, 100 nm, 50 nm, or less. In some embodiments, saidaverage minimum center-to-center distance between one or more analytesof a plurality of analytes is about 250 nm. In some embodiments, saidtreating of said one or more analytes with a repellant or attractivesubstance comprises applying said repellant or attractive substance tosaid surface prior to depositing said plurality of analytes to saidsurface. In some embodiments, said surface is unpatterned. In someembodiments, said surface is patterned.

Another aspect of the present disclosure comprises a method ofcontrolling a distribution of an average minimum center-to-centerdistance between one or more analytes of a plurality of analytesdeposited on a surface, said method comprising: treating said one ormore analytes with a repellant or attractive substance; exposing saidplurality of analytes to gas-liquid interface such that said pluralityof analytes forms a monolayer of analytes deposited across said surface.In some embodiments, concatemers are loaded on the surface and closelypacked to enable a center to center distance of—250 nm with a varianceof +/−25 nm. In some embodiments, the average center-to-center distancebetween molecules of about 315 nm. In some embodiments, the plurality ofanalytes (e.g., nucleic acid molecules) may be deposited adjacent to asurface such that adjacent analytes of the plurality of analytes mayhave average center-to-center spacings of at least 10 nanometers (nm),50 nm, 100 nm, 110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm,180 nm, 190 nm, 200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm,270 nm, 280 nm, 290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm,360 nm, 370 nm, 380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm,450 nm, 460 nm, 470 nm, 480 nm, 490 nm, 500 nm, or more. The averagecenter-to-center spacings may be less than or equal to 500 nm, 490 nm,480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm,390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm,300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm,210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm,120 nm, 110 nm, 100 nm, 50 nm, or less. In some embodiments, saidsurface is unpatterned. In some embodiments, said surface is patterned.In some embodiments, said gas-liquid interface is an air-waterinterface. In some embodiments, the depositing of (c) comprises pullingor dragging. In some embodiments, said average minimum center-to-centerdistance between one or more analytes of a plurality of analytes is lessthan about 500 nm. In some embodiments, said average minimumcenter-to-center distance between one or more analytes of a plurality ofanalytes is about 315 nm. In some embodiments, said average minimumcenter-to-center distance between one or more analytes of a plurality ofanalytes is about 250 nm.

Another aspect of the present disclosure comprises a system comprising aplurality of nucleic acid molecules adjacent to a surface, whichplurality of nucleic acid molecules do not contact one another. In someembodiments, concatemers are loaded on the surface and closely packed toenable a center to center distance of—250 nm with a variance of +/−25nm. In some embodiments, the average center-to-center distance betweenmolecules of about 315 nm. In some embodiments, the plurality ofanalytes (e.g., nucleic acid molecules) may be deposited adjacent to asurface such that adjacent analytes of the plurality of analytes mayhave average center-to-center spacings of at least 10 nanometers (nm),50 nm, 100 nm, 110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm,180 nm, 190 nm, 200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm,270 nm, 280 nm, 290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm,360 nm, 370 nm, 380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm,450 nm, 460 nm, 470 nm, 480 nm, 490 nm, 500 nm, or more. The averagecenter-to-center spacings may be less than or equal to 500 nm, 490 nm,480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm,390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm,300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm,210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm,120 nm, 110 nm, 100 nm, 50 nm, or less. In some embodiments, saidsurface is unpatterned. In some embodiments, said surface is patterned.In some embodiments, said plurality of nucleic acid molecules are aplurality of concatemers. In some embodiments, adjacent nucleic acidmolecules of said plurality of nucleic acid molecules have an averagecenter-to-center spacing of less than about 500 nm.

Another aspect of the present disclosure comprises a method, comprisingproviding a plurality of nucleic acid molecules adjacent to a surfaceunder conditions such that said plurality of nucleic acid molecules donot contact one another. In some embodiments, concatemers are loaded onthe surface and closely packed to enable a center to center distanceof—250 nm with a variance of +/−25 nm. In some embodiments, the averagecenter-to-center distance between molecules of about 315 nm. In someembodiments, the plurality of analytes (e.g., nucleic acid molecules)may be deposited adjacent to a surface such that adjacent analytes ofthe plurality of analytes may have average center-to-center spacings ofat least 10 nanometers (nm), 50 nm, 100 nm, 110 nm, 120 nm, 130 nm, 140nm, 150 nm, 160 nm, 170 nm, 180 nm, 190 nm, 200 nm, 210 nm, 220 nm, 230nm, 240 nm, 250 nm, 260 nm, 270 nm, 280 nm, 290 nm, 300 nm, 310 nm, 320nm, 330 nm, 340 nm, 350 nm, 360 nm, 370 nm, 380 nm, 390 nm, 400 nm, 410nm, 420 nm, 430 nm, 440 nm, 450 nm, 460 nm, 470 nm, 480 nm, 490 nm, 500nm, or more. The average center-to-center spacings may be less than orequal to 500 nm, 490 nm, 480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm,420 nm, 410 nm, 400 nm, 390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm,330 nm, 320 nm, 310 nm, 300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm,240 nm, 230 nm, 220 nm, 210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm,150 nm, 140 nm, 130 nm, 120 nm, 110 nm, 100 nm, 50 nm, or less. In someembodiments, said surface is unpatterned. In some embodiments, saidsurface is patterned. In some embodiments, said plurality of nucleicacid molecules are a plurality of concatemers. In some embodiments,adjacent nucleic acid molecules of said plurality of nucleic acidmolecules have an average center-to-center spacing of less than about500 nm.

Another aspect of the present disclosure provides a non-transitorycomputer readable medium comprising machine executable code that, uponexecution by one or more computer processors, implements any of themethods above or elsewhere herein.

Another aspect of the present disclosure provides a system comprisingone or more computer processors and computer memory coupled thereto. Thecomputer memory comprises machine executable code that, upon executionby the one or more computer processors, implements any of the methodsabove or elsewhere herein.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only illustrative embodiments of thepresent disclosure are shown and described. As will be realized, thepresent disclosure is capable of other and different embodiments, andits several details are capable of modifications in various obviousrespects, all without departing from the disclosure. Accordingly, thedrawings and description are to be regarded as illustrative in nature,and not as restrictive.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.To the extent publications and patents or patent applicationsincorporated by reference contradict the disclosure contained in thespecification, the specification is intended to supersede and/or takeprecedence over any such contradictory material.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present invention will be obtained by reference to thefollowing detailed description that sets forth illustrative embodiments,in which the principles of the invention are utilized, and theaccompanying drawings (also “Figure” and “FIG.” herein), of which:

FIG. 1 shows sequencer throughput versus array pitch and outlines asystem design which meets the criteria needed for a S10 genome.

FIG. 2A shows a proposed embodiment of a high-density region of 80 nmdiameter binding regions (spots) on a 240 nm pitch for low costsequencing.

FIG. 2B is a comparison of the proposed substrate density compared to asample effective density used for a 1,000 genome.

FIG. 3 shows crosstalk calculations for simulated detection ofindividual analytes on a 600 nm pitch processed with a 2× filter.

FIG. 4 shows Oversampled 2× (left) vs. Oversampled 4× and Deconvolved(right) simulations of images of detection of single analytes on asubstrate at center-to-center distances of 400 nm, 300 nm, and 250 nm. Asingle image of Oversampled 4× and Deconvolved at a center-to-centerdistance of 200 nm is also shown.

FIG. 5 shows a plot of crosstalk between adjacent spots at differentcenter-to-center distances between single analytes (array pitch (nm))processed using Oversampled 2× vs. Oversampled 4× and Deconvolvedsimulations.

FIG. 6 depicts a flowchart for a method of determining the relativepositions of analytes on a substrate with high accuracy, according to anembodiment of the present disclosure.

FIG. 7 depicts a flowchart for a method of identifying individualanalytes from deconvolved optical signals detected from a substrate,according to an embodiment of the present disclosure.

FIG. 8 depicts a flowchart for a method of sequencing polynucleotidesdeposited on a substrate, according to an embodiment of the presentdisclosure.

FIG. 9 shows an overview of operations in an optical signal detectionprocess from cycled detection, according to an embodiment of the presentdisclosure.

FIG. 10A shows a flowchart of operations for initial raw image analysis,according to an embodiment of the present disclosure.

FIG. 10B shows a flowchart of operations for location determination fromoptical signal peak information from a plurality of cycles, according toan embodiment of the present disclosure.

FIG. 10C shows a flowchart of operations for identification ofoverlapping optical signals from an image using accurate relativepositional information and image deconvolution algorithms, according toan embodiment of the present disclosure.

FIG. 11 depicts a detailed flowchart of operations for an optical signaldetection and deconvolution process for images from cycled detection ofa densely-packed substrate, according to an embodiment of the presentdisclosure.

FIG. 12A shows a cross-talk plot of fluorophore intensity between fourfluorophores from optical signals detected from the raw image.

FIG. 12B shows a cross-talk plot of fluorophore intensity between fourfluorophores from a 4× oversampled image.

FIG. 13A shows a cross-talk plot of fluorophore intensity between fourfluorophores from a 4× oversampled image without deconvolution ornearest neighbor correction.

FIG. 13B shows a cross-talk plot of fluorophore intensity between fourfluorophores from a 4× oversampled and deconvolved image using adeconvolution algorithm with accurate analyte position information,according to an embodiment of the present disclosure.

FIG. 14A shows a simulated four-color composite of a raw image of afield at a center-to-center spacing between analytes of about 315 nm.

FIG. 14B shows a simulated four-color composite of a deconvolved imageat a center-to-center spacing between analytes of about 315 nm.

FIG. 15A shows results of sequencing of a 1:1 mixture of syntheticoligonucleotide templates corresponding to the region around codon 790in the EGFR gene containing equal amounts of mutant and wild type (WT)targets.

FIG. 15B depicts images from alternating base incorporation and cleavagecycles.

FIG. 16 is an image of single analytes deposited on a substrate andbound by a probe comprising a fluorophore.

FIG. 17, right panel, shows peaks from oversampled images of a fieldfrom each cycle overlaid from several analytes on a substrate (clustersof peaks). The left panel is the smoothed version of the right panel,recapitulating a Gaussian distribution of peaks from an analyte across aplurality of cycles with a highly accurate peak indicating relativepositional information.

FIG. 18 shows localization variation for each of a plurality ofmolecules found in a field. The median localization variance is 5 nm andthe 3 sigma localization variance is under 10 nm.

FIG. 19 shows a flowchart of deoxyribonucleic acid (DNA) libraryconstruction, circularization, and concatemer formation, according to anembodiment of the present disclosure.

FIG. 20 shows a flowchart of DNA library construction, circularization,and concatemer formation, including synthesis of ssDNA ‘hairs’ on theconcatemer to facilitate exclusion for formation of a layer ofconcatemers, according to an embodiment of the present disclosure.

FIGS. 21A and 21B depict coated concatemers to facilitate exclusion fromother concatemers in a layer of concatemers, according to an embodimentof the present disclosure.

FIG. 22 shows an embodiment of a closely-packed randomly distributedlayer of concatemers, according to an embodiment of the presentdisclosure.

FIG. 23A shows a flow chart to form a library of circularized DNAcomprising target sequences from a sample, according to an embodiment ofthe present disclosure.

FIG. 23B shows a flow chart to load concatemers on a layer on asubstrate and to sequence the concatemers, according to an embodiment ofthe present disclosure.

FIG. 24 depicts an embodiment of the use of a unique molecule identifierto include source information (or other information) in each concatemer,according to an embodiment of the present disclosure.

FIGS. 25A-25C show images of concatemer layers distributed at highdensity on the surface of a substrate, according to some embodiments ofthe present disclosure. FIG. 25D depicts a graph of concatemer surfacedensity, according to some embodiments of the present disclosure.

FIGS. 26A-26D depicts images of concatemers bound to a substrate usedfor sequencing a concatemer target, showing successful resolution ofsequences between adjacent nearby concatemers.

FIGS. 27A-27C show the results of sequencing by synthesis of E. coliusing the methods and systems described herein. FIGS. 27A-27B showvarious base pair reads. FIG. 27C shows the resolution of base callingat individual spots for E. coli sequencing.

FIG. 28 shows a computer system that is programmed or otherwiseconfigured to implement methods provided herein.

DETAILED DESCRIPTION

While various embodiments of the invention have been shown and describedherein, it will be obvious to those skilled in the art that suchembodiments are provided by way of example only. Numerous variations,changes, and substitutions may occur to those skilled in the art withoutdeparting from the invention. It should be understood that variousalternatives to the embodiments of the invention described herein may beemployed.

Whenever the term “at least,” “greater than,” or “greater than or equalto” precedes the first numerical value in a series of two or morenumerical values, the term “at least,” “greater than” or “greater thanor equal to” applies to each of the numerical values in that series ofnumerical values. For example, greater than or equal to 1, 2, or 3 isequivalent to greater than or equal to 1, greater than or equal to 2, orgreater than or equal to 3.

Whenever the term “no more than,” “less than,” or “less than or equalto” precedes the first numerical value in a series of two or morenumerical values, the term “no more than,” “less than,” or “less than orequal to” applies to each of the numerical values in that series ofnumerical values. For example, less than or equal to 3, 2, or 1 isequivalent to less than or equal to 3, less than or equal to 2, or lessthan or equal to 1.

As used herein, the term “center-to-center distance” generally refers toa distance between two adjacent molecules as measured by the differencebetween the average position of each molecule on a substrate. The termaverage minimum center-to-center distance refers specifically to theaverage distance between the center of each analyte disposed on thesubstrate and the center of its nearest neighboring analyte, althoughthe term center-to-center distance refers also to the minimumcenter-to-center distance in the context of limitations corresponding tothe density of analytes on the substrate. As used herein, the term“pitch” or “average effective pitch” is generally used to refer toaverage minimum center-to-center distance. In the context of regulararrays of analytes, pitch may also be used to determine acenter-to-center distance between adjacent molecules along a definedaxis.

As used herein, the term “overlaying” (e.g., overlaying images)generally refers to overlaying images from different cycles to generatea distribution of detected optical signals (e.g., position andintensity, or position of peak) from each analyte over a plurality ofcycles. This distribution of detected optical signals can be generatedby overlaying images, overlaying artificial processed images, oroverlaying datasets comprising positional information. Thus, as usedherein, the term “overlaying images” generally encompasses any of thesemechanisms to generate a distribution of position information foroptical signals from a single probe bound to a single analyte for eachof a plurality of cycles.

A “cycle” is generally defined by completion of one or more passes andstripping of the detectable label from the substrate. Subsequent cyclesof one or more passes per cycle can be performed. For the methods andsystems described herein, multiple cycles are performed on a singlesubstrate or sample. For deoxyribonucleic acid (DNA) sequencing,multiple cycles may require the use of a reversible terminator and aremovable detectable label from an incorporated nucleotide. Forproteins, multiple cycles may require that the probe removal (stripping)conditions either maintain proteins folded in their properconfiguration, or that the probes used are chosen to bind to peptidesequences so that the binding efficiency is independent of the proteinfold configuration.

A “pass” in a detection assay generally refers to a process where aplurality of probes comprising a detectable label are introduced to thebound analytes, selective binding occurs between the probes and distincttarget analytes, and a plurality of signals are detected from thedetectable labels. A pass includes introduction of a set of antibodiesthat bind specifically to a target analyte. A pass can also includeintroduction of a set of labelled nucleotides for incorporation into thegrowing strand during sequencing by synthesis. There can be multiplepasses of different sets of probes before the substrate is stripped ofall detectable labels, or before the detectable label or reversibleterminator is removed from an incorporated nucleotide during sequencing.In general, if four nucleotides are used during a pass, a cycle may onlyinclude a single pass for standard four nucleotide sequencing bysynthesis.

As used herein, an “image” generally refers to an image of a field takenduring a cycle or a pass within a cycle. In some embodiments, a singleimage is limited to detection of a single color of a detectable label.

As used herein, the term “field” generally refers to a single region ofa substrate that is imaged. During a typical assay a single field isimaged at least once per cycle. For example, for a 20 cycle assay, with4 colors, there can be 20*4=80 images, all of the same field.

A “target analyte” or “analyte” generally refers to a molecule,compound, complex, substance or component that is to be identified,quantified, and otherwise characterized. A target analyte can compriseby way of example, but not limitation to, a single molecule (of anymolecular size), a single biomolecule, a polypeptide, a protein (foldedor unfolded), a polynucleotide molecule (ribonucleic acid (RNA),complementary DNA (cDNA), or DNA), a fragment thereof, a modifiedmolecule thereof, such as a modified nucleic acid, or a combinationthereof. In an embodiment, a target polynucleotide comprises ahybridized primer to facilitate sequencing by synthesis. The targetanalytes are recognized by probes, which can be used to sequence,identify, and quantify the target analytes using optical detectionmethods described herein.

A “probe,” as used herein generally refers to a molecule that is capableof binding to other molecules (e.g., a complementary labelled nucleotideduring sequencing by synthesis, polynucleotides, polypeptides orfull-length proteins, etc.), cellular components or structures (lipids,cell walls, etc.), or cells for detecting or assessing the properties ofthe molecules, cellular components or structures, or cells. The probecomprises a structure or component that binds to the target analyte. Insome embodiments, multiple probes may recognize different parts of thesame target analyte. Examples of probes include, but are not limited to,a labelled reversible terminator nucleotide, an aptamer, an antibody, apolypeptide, an oligonucleotide (DNA, RNA), or any combination thereof.Antibodies, aptamers, oligonucleotide sequences and combinations thereofas probes are also described in detail below.

The probe can comprise a detectable label that is used to detect thebinding of the probe to a target analyte. The probe can be directly orindirectly bound to, hybridized to, conjugated to, or covalently linkedto the target analyte.

As used herein, the term “detectable label” generally refers to amolecule bound to a probe that can generate a detectable optical signalwhen the probe is bound to a target analyte and imaged using an opticalimaging system. The detectable label can be directly or indirectly boundto, hybridized to, conjugated to, or covalently linked to the probe. Insome embodiments, the detectable label is a fluorescent molecule or achemiluminescent molecule. The probe can be detected optically via thedetectable label.

As used herein, the term “optical distribution model” generally refersto a statistical distribution of probabilities for light detection froma point source. These include, for example, a Gaussian distribution. TheGaussian distribution can be modified to include anticipated aberrationsin detection to generate a point spread function as an opticaldistribution model.

Provided herein are systems and methods that facilitate opticaldetection and discrimination of probes bound to tightly packed analytesbound to the surface of a substrate. In part, the methods and systemsdescribed herein rely on repeated detection of a plurality of targetanalytes on the surface of a substrate to improve the accuracy ofidentification of a relative location of each analyte on the substrate.This information can then be used to perform signal resolving on eachimage of a field of the substrate for each cycle to reliably identify asignal from a probe bound to the target analyte. In some embodiments,the resolving comprises deconvolution. In some embodiments, this type ofdeconvolution processing can be used to distinguish between differentprobes bound to the target analyte that have overlapping emissionspectrum when activated by an activating light. In some embodiments, thedeconvolution processing can be used to separate optical signals fromneighboring analytes. This is especially useful for substrates withanalytes having a density wherein optical detection is challenging dueto the diffraction limit of optical systems.

In some embodiments, the methods and systems described herein areparticularly useful in sequencing. By providing methods and systems thatfacilitate reliable optical detection on densely packed substrates,costs associated with sequencing, such as reagents, number of clonalmolecules used, processing and read time, can all be reduced to greatlyadvance sequencing technologies, specifically, sequencing by synthesisusing optically detected nucleotides.

Although the systems and methods described herein have importantimplications for advancing sequencing technology, the methods andsystems described herein are generally applicable to optical detectionof analytes bound to the surface of a substrate, including on the singlemolecule level.

Sequencing Cost Reduction

Sequencing technologies include image-based systems developed bycompanies such as Illumina and Complete Genomics and electrical basedsystems developed by companies such as Ion Torrent and Oxford Nanopore.Image-based sequencing systems currently have the lowest sequencingcosts of all existing sequencing technologies. Image-based systemsachieve low cost through the combination of high throughput imagingoptics and low-cost consumables. However, prior art optical detectionsystems have minimum center-to-center spacing between adjacentresolvable molecules of about a micron, in part due to the diffractionlimit of optical systems. In some embodiments, described herein aremethods for attaining significantly lower costs for an image-basedsequencing system using existing biochemistries using cycled detection,determination of precise positions of analytes, and use of thepositional information for highly accurate deconvolution of imagedsignals to accommodate increased packing densities below the diffractionlimit.

Densely-Packed Analyte Layers and Detection Methods

Provided herein are systems and methods to facilitate imaging of signalsfrom analytes deposited on a surface with a center-to-center spacingbelow the diffraction limit. These systems and methods use advancedimaging systems to generate super-resolution images, and cycleddetection to facilitate positional determination of molecules on thesubstrate with high accuracy and resolving of images to obtain signalidentity for each molecule on a densely packed surface with highaccuracy. These methods and systems allow sequencing by synthesis on adensely packed substrate to provide highly efficient and very highthroughput polynucleotide sequence determination with high accuracy.

The major cost components for sequencing systems are primarily theconsumables which include biochip and reagents and secondarily theinstrument costs. To reach a S10 30× genome, a 100-fold cost reduction,the amount of data per unit area needs to increase by 100-fold and theamount of reagent per data point needs to drop by 100-fold.

FIG. 1 shows sequencer throughput versus array pitch and outlines asystem design which meets the criteria needed for a S10 genome. Thebasic idea is that to achieve a 100-fold cost reduction, the amount ofdata per unit area needs to increase by 100-fold and the amount ofreagent per data point needs to drop by 100-fold. To achieve thesereductions in costs, provided herein are methods and systems thatfacilitate reliable sequencing of polynucleotides deposited on thesurface of a substrate at a density below the diffraction limit. Thesehigh densities allow for more efficient usage of reagents and increasethe amount of data per unit area. In addition, the increase in thereliability of detection allows for a decrease in the number of clonalcopies that may be synthesized to identify and correct errors insequencing and detection, further reducing reagent costs and dataprocessing costs.

High Density Distributions of Analytes on a Surface of a Substrate

FIG. 2A shows a proposed embodiment of a high-density region of 80 nmdiameter binding regions (spots) on a 240 nm pitch. In this embodiment,an ordered array can be used where single-stranded DNA moleculeexclusively binds to specified regions on chip. In some embodiments,concatemers (i.e., a long continuous DNA molecule that contains multiplecopies of the same DNA sequence linked in series) smaller than 40 kB areused so as to not overfill the spot. The size of the concatemers scalesroughly with area, meaning the projected length of the smallerconcatemer may be approximate 4 kB to 5 kB resulting in approximately 10copies if the same amplification process is used. It is also possible touse 4 kB lengths of DNA and sequence each concatemer directly. Anotheroption is to bind a shorter segment of DNA with unsequenced filler DNAto bring the total length up to the size needed to create anexclusionary molecule.

FIG. 2B is a comparison of the proposed pitch compared to a sampleeffective pitch used for a S1,000 genome. The density of the new arrayis 170-fold higher, meeting the criteria of

achieving 100-fold higher density. The number of copies per imaging spotper unit area also meets the criteria of being at least 100-fold lowerthan the prior existing platform. This helps ensure that the reagentcosts are 100-fold more cost effective than baseline.

Imaging Densely Packed Single Biomolecules and the Diffraction Limit

One constraint for increased molecular density for an imaging platformis the diffraction limit. The equation for the diffraction limit of anoptical system is:

D=λ/2NA

where D is the diffraction limit, λ, is the wavelength of light, and NAis the numerical aperture of the optical system. Typical air imagingsystems have NA's of 1.0 to 1.2. Using λ, =600 nm, the diffraction limitis between 250 nm and 300 nm. For a water immersion system, the NAis—1.0, giving a diffraction limit of 300 nm.

If features on an array or other substrate surface comprisingbiomolecules are too close, two optical signals may overlapsubstantially so that you just see a single blob that cannot be reliablyresolved based on the image alone. This can be exacerbated by errorsintroduced by the optical imaging system, such as blur due to inaccuratetracking of a moving substrate, or optical variations in the light pathbetween the sensor and the surface of a substrate.

The transmitted light or fluorescence emission wavefronts emanating froma point in the specimen plane of the microscope become diffracted at theedges of the objective aperture, effectively spreading the wavefronts toproduce an image of the point source that is broadened into adiffraction pattern having a central disk of finite, but larger sizethan the original point. Therefore, due to diffraction of light, theimage of a specimen never perfectly represents the real details presentin the specimen because there is a lower limit below which themicroscope optical system cannot resolve structural details.

The observation of sub-wavelength structures with microscopes isdifficult because of the diffraction limit. A point object in amicroscope, such as a fluorescent protein or polynucleotide, maygenerate an image at the intermediate plane that may include adiffraction pattern created by the action of interference. When highlymagnified, the diffraction pattern of the point object may be observedto include a central spot (diffraction disk) surrounded by a series ofdiffraction rings. Combined, this point source diffraction pattern isreferred to as an Airy disk.

The size of the central spot in the Airy pattern is related to thewavelength of light and the aperture angle of the objective. For amicroscope objective, the aperture angle is described by the numericalaperture (NA), which includes the term sin (0), the half angle overwhich the objective can gather light from the specimen. In terms ofresolution, the radius of the diffraction Airy disk in the lateral (x,y)image plane is defined by the following formula: Abbe Resolution=λ/2*NA,where λ is the average wavelength of illumination in transmitted lightor the excitation wavelength band in fluorescence. The objectivenumerical aperture (NA=n·sin(θ)) is defined by the refractive index ofthe imaging medium (n; usually air, water, glycerin, or oil) multipliedby the sine of the aperture angle (sin(θ)). As a result of thisrelationship, the size of the spot created by a point source decreaseswith decreasing wavelength and increasing numerical aperture, but alwaysremains a disk of finite diameter. The Abbe resolution (i.e., Abbelimit) is also referred to herein as the diffraction limit and definesthe resolution limit of the optical system.

If the distance between the two Airy disks or point-spread functions isgreater than the diffraction limit, the two point sources are consideredto be resolved (and can readily be distinguished). Otherwise, the Airydisks merge together and are considered not to be resolved.

Thus, light emitted from a detectable label point source with wavelength2, traveling in a medium with refractive index n and converging to aspot with half-angle θ may make a diffraction limited spot with adiameter: d=λ/2*NA. Considering green light around 500 nm and a NA(Numerical Aperture) of 1, the diffraction limit is roughly d=λ/2=250 nm(0.25 pm), which limits the density of analytes such as proteins,nucleotides and other sequencing substrates (e.g., as shown in FIG. 20)on a surface able to be imaged by conventional imaging techniques. Asused herein, sequencing substrates include any analyte that sequenceinformation can be derived from, such as a template for a sequencingreaction. Even in cases where an optical microscope is equipped with thehighest available quality of lens elements, is perfectly aligned, andhas the highest numerical aperture, the resolution remains limited toapproximately half the wavelength of light in the best-case scenario. Toincrease the resolution, shorter wavelengths can be used such as UV andX-ray microscopes. These techniques offer better resolution but areexpensive, suffer from lack of contrast in biological samples and maydamage the sample.

Image Resolving

In some embodiments, the image resolving methods described hereincomprise deconvolution. Deconvolution is an algorithm-based process usedto reverse the effects of convolution on recorded data. The concept ofdeconvolution is widely used in the techniques of signal processing andimage processing. Because these techniques are in turn widely used inmany scientific and engineering disciplines, deconvolution finds manyapplications.

In optics and imaging, the term “deconvolution” is specifically used torefer to the process of reversing the optical distortion that takesplace in an optical microscope, electron microscope, telescope, or otherimaging instrument, thus creating clearer images. It is usually done inthe digital domain by a software algorithm, as part of a suite ofmicroscope image processing techniques.

The usual method is to assume that the optical path through theinstrument is optically perfect, convolved with a point spread function(PSF), that is, a mathematical function that describes the distortion interms of the pathway a theoretical point source of light (or otherwaves) takes through the instrument. Usually, such a point sourcecontributes a small area of fuzziness to the final image. If thisfunction can be determined, it is then a matter of computing its inverseor complementary function, and convolving the acquired image with that.Deconvolution maps to division in the Fourier co-domain. This allowsdeconvolution to be easily applied with experimental data that aresubject to a Fourier transform. An example is NMR spectroscopy where thedata are recorded in the time domain, but analyzed in the frequencydomain. Division of the time-domain data by an exponential function hasthe effect of reducing the width of Lorenzian lines in the frequencydomain. The result is the original, undistorted image.

However, for diffraction limited imaging, deconvolution is also neededto further refine the signals to improve resolution beyond thediffraction limit, even if the point spread function is perfectly known.It is very hard to separate two objects reliably at distances smallerthan the Nyquist distance. However, described herein are methods andsystems using cycled detection, analyte position determination,alignment, and deconvolution to reliably detect objects separated bydistances much smaller than the Nyquist distance.

Making High Density Random Layers of Concatemers for Sequencing

Also provided herein are methods of making and using high densityconcatemer layers. In some embodiments, the concatemers are randomlydistributed on a surface of a substrate in a close-packed layer forindividual detection and sequencing. In some embodiments, providedherein are methods of making and randomly distributing a layer ofconcatemers on a substrate such that they achieve a high density oraverage center-to-center distance.

Concatemers (i.e., CATs), are long single-stranded DNA molecules madethrough rolling circle amplification (RCA) of a ssCircular DNA. In someembodiments, the concatemers each comprise from a few up to severalhundred copies of a target DNA sequence inserted between known sequenceadapters. A library of concatemers comprising target DNA sequences canbe generated. In some embodiments, the concatemers comprise featuresthat self-exclude to facilitate layering a close-packed single layer ofconcatemers on a substrate with minimal overlap or a minimum distancebetween adjacent concatemers and without needing specific attachmentpoints on the substrate. These exclusionary features facilitateclose-packed layers while minimizing the number of nearest neighborconcatemers that are too close to be resolved by optical imaging, asdescribed herein.

In some embodiments, provided herein are substrates comprising asurface, wherein the surface is bound to a close-packed, randomlydistributed collection of amplified targets, such as DNA concatemers.

In some embodiments, this substrate is used to facilitate nucleotidesequencing, including of whole genomes or exomes. In some embodiments,large numbers of individual cellular targets can be sequenced. These canrepresent a selected panel of targets using cluster sequencing.Sequencing as described herein can be used, for example, to (i) detectmultiple genetic variants (e.g., for genotyping, drug resistancedetermination, paternity, or identification), (ii) sequence multiplecDNA molecules for gene expression analysis for enumeration of pathwaydynamics, or (iii) detect methylated residues on a target polynucleotidefollowing bi-sulphite treatment. In some embodiments, sequencing methodsrequire target amplification to generate small clusters of—200 targetcopies as described in the embodiments.

The method, in one embodiment, comprises: the creation of circularizedsingle stranded molecules for targets across the genome using ligasereactions, amplification of the circularized DNA using isothermal wholegenome amplification methods to generate clusters of circularizedamplified targets (CAT) that have a few hundred copies, and ensuringthat the CATs are coated with appropriate reagents to generatenanospheres that have a uniform size around 250 nm with a distributionaround 225-275 nm.

The method, in one embodiment further comprises: distributing the CATson a bio-chip in a densely packed collection and attaching them to thesurface with removal of the coating materials, and ensuring that theCATs remain bound to the slide through multiple cycles of sequencingreactions.

In some embodiments, the target biomolecules are detected and/orsequenced and authenticated based on repeat hybridizations. Thisfacilitates improved accuracy, including a decrease in sensitivityand/or specificity to provide improved target identification and/orsequencing.

In some embodiments, single base extension assays and oligonucleotideligation assays are performed at single molecule levels to provideauthentication. This level of authentication allows very highmultiplexing and digital counting to quantify relative and absoluteabundance with a higher accuracy previously unavailable via opticalimaging.

Sequencing

Optical detection imaging systems are diffraction-limited, and thus havea theoretical maximum resolution of—300 nm with fluorophores typicallyused in sequencing. To date, the best sequencing Systems have hadcenter-to-center spacings between adjacent polynucleotides of—600 nm ontheir arrays, or—2× the diffraction limit. This factor of 2× is neededto account for intensity, array & biology variations that can result inerrors in position. To achieve a 10 genome, an approximately 200 nmcenter to center spacing is required, which requiressub-diffraction-limited imaging capability.

For sequencing, the purpose of the system and methods described hereinare to resolve polynucleotides that are sequenced on a substrate with acenter-to-center spacing below the diffraction limit of the opticalsystem.

As described herein, we provide methods and systems to achievesub-diffraction-limited imaging in part by identifying a position ofeach analyte with a high accuracy (e.g., 10 nm RMS or less). Bycomparison, state of the art Super Resolution systems can only identifylocation with an accuracy down to 20 nm RMS, 2× worse than this system.Thus, the methods and system disclosed herein enable sub-diffractionlimited-imaging to identify densely-packed molecules on a substrate toachieve a high data rate per unit of enzyme, data rate per unit of time,and high data accuracy. These sub-diffraction limited imaging techniquesare broadly applicable to techniques using cycled detection as describedherein.

Multiple Cycles of Sequencing Concatemers Methods of Making CATs

Creation of Circularized ssDNA Targets

In some embodiments, described herein are methods of preparing a libraryof concatemers to distribute as a layer onto the surface of a substrate,e.g., as randomly distributed, densely packed layer. To synthesizeconcatemers comprising target DNA to be sequenced, first, target DNA canbe amplified and converted into circular DNA templates. In someembodiments, amplification products undergo circular template ligation,which can be conducted via template mediated enzymatic ligation (e.g.,T4 DNA ligase) or template-free ligation using special DNA ligases(i.e., CircLigase) to form a precursor to the concatemers formed viarolling circle amplification of the circular DNA templates.

RCA/RCR Basic Technique

Rolling circle replication describes a process of unidirectional nucleicacid replication that can rapidly synthesize multiple copies of circularmolecules of DNA or RNA.

RCA (rolling circle amplification) is an isothermal nucleic acidamplification technique where the polymerase continuously adds singlenucleotides to a primer annealed to a circular template which results ina long concatemer ssDNA that contains tens to hundreds of tandem repeats(complementary to the circular template).

Rolling circle amplification can be performed by exposing the circularDNA templates to: 1. A DNA polymerase. 2. A suitable buffer that iscompatible with the polymerase. 3. A short DNA or RNA primer. 4.Deoxynucleotide triphosphates (dNTPs).

In some embodiments, the polymerase used in rolling circle amplificationis Phi29, Bst, or Ventexo-DNA polymerase for DNA amplification, and T7RNA polymerase for RNA amplification. RCA can be conducted at a constanttemperature (room temperature to 37° C.) in both free solution and ontop of deposited targets (solid phase amplification). A DNA RCA reactiontypically proceeds via primer-induced single-strand DNA elongation.

In some embodiments, a method for constructing concatemer libraries ofsequencing substrates to load onto a physical substrate, such as a flowcell, is shown in FIG. 19. In some embodiments, concatemer libraries ofsequencing substrates are constructed as shown in FIG. 20. ‘Hairs’ aressDNA molecules that can be generated by using a reverse primer tosynthesize in the opposite direction as the extending concatemer DNA.These ‘hairs’ can be used to control the size and/or exclusionproperties of the concatemers. In some embodiments, the sequencingreaction described herein occurs using the ssDNA ‘hairs’ as templates.

Terminating RCR Reaction

The rolling circle amplification of the CAT can be stopped by theaddition of EDTA to chelate the essential Mg2+co-factor of the phi29enzyme. Phi29 is a strongly displacing polymerase, while the standardpolymerases used for sequencing, for example Therminator 9, are onlyweakly displacing. A more displacing enzyme for sequencing thissubstrate may be used or adapted.

Alternatively, one may use single strand binding proteins (SSBs) orhelicases, or combinations of them to aid in the displacement. These maybe added to the extension reaction or used as pre-incubation operationsto prepare the substrate for sequencing.

Alternatively, the rolling circle reaction may be stopped using anunlabeled reversible terminator. This may be a way to make the stoppagemore uniform within the solution, yielding more uniform-sized CATs thanstoppage with EDTA. Additionally, the sequencing reaction may then beinitiated from the unblocking operation, followed by extension withlabeled reversible terminator nucleotides. This may allow for thenatural selection of substrates that where the extending 3′ end wasaccessible for the normal reactions of sequencing by synthesis.

The phi29 is likely very tightly bound to the extending end of the CAT.The use of a reversible terminator to stop the reaction may destabilizethat interaction. Other protein denaturants like chaotropic salts ordetergents may be necessary to displace the phi29 to enable thesequencing reaction

Concatemer Composition

The CATs have several identical copies of the target DNA on theextending single strand. CATs can also have several identical reversecopies of the target DNA on ssDNA ‘hairs’ generated as described above.

In some embodiments concatemers are at least 1,000 nucleotides in length(no more than, from 400,000).

In some embodiments, concatemers are at least 150 nm in diameter (nomore than 300 nm). Preferably, the exclusion zone between adjacentconcatemers is not less than the minimum center-to-center distancenecessary to achieve the desired density or pitch.

Densely-Packed Random Arrays Methods of Making Arrays (RandomlyDistributed Close Packed Layer of Concatemers) Controlled Spacing

Provided herein are several mechanisms to control the distribution ofminimum center-to-center distance between CATs arrayed on anun-patterned surface. In some embodiments, these methods andcompositions facilitate formation of a uniform, close-packedself-assembled random layer of CATs with a controlled minimumcenter-to-center distance between adjacent CATs such that they can besequenced with minimal cross-talk between the dye-labeled sequencingsubstrates.

The CATs themselves are mutually repellant in solution due to theirstrong negative charge, but they may nonetheless be too close to eachother for effective diffusion-limited resolution of labeled adjacentCATs once adsorbed to a surface.

In some embodiments, the concatemers are ‘encased’ or ‘enveloped’ in ashell of a repellant or attractive substance to increase their effectiveexclusion size without altering the size of the CAT itself or the numberof copies of the sequencing substrate they contain.

In some embodiments, a protein layer to which the CATs adsorb on thesurface of the substrate is modified to space the interacting proteinsout on the surface. For example, the CATs can interact with the glass,silicon or modified (e.g. amino-silanated) surface through aninteraction with proteins that have been previously adsorbed to thesurface.

Thus, modifications of the CAT or the protein partner of the bindingpair can assist in size exclusion to achieve a uniform, densely-packedlayer of concatemers on a surface without specific attachment points forthe CATs. In some embodiments, these modifications include crosslinkingor attaching molecules like PEG or polysaccharide to coat the CAT or itsprotein binding partner.

Shown in FIG. 21A an 21B is an embodiment depicting coated concatemers.

The inner core in this embodiment may be multiple copies of a DNA targetthat are entwined. The outer layer, i.e., the coating, can includecompounds like PEG, compounds with zwitterionic features, ampholineampholytes, sulphobetaine, and other similar molecules with the positivecharges interacting with nucleic acid on the inside and negative chargeson the outside the ensure the nanospheres do not clump.

Loading of CATs on the Chip

In some embodiments concatemers are distributed onto an unpatternedsurface of a substrate in a high density layer. This close-packedformation facilitates formation of tightly packed sequencing substratesto enable higher throughput and/or lower cost sequencing. In someembodiments, said surface is patterned. An example of a densely packedconcatemer layer on an unpatterned surface is shown in FIG. 25.

In some embodiments, concatemers are loaded on a biochip and closelypacked to enable a center to center distance of—250 nm with a varianceof +/−25 nm.

In some embodiments, the average center-to-center distance betweenmolecules of about 315 nm. In some embodiments, the plurality ofanalytes (e.g., nucleic acid molecules) may be deposited adjacent to asurface such that adjacent analytes of the plurality of analytes mayhave average center-to-center spacings of at least 10 nanometers (nm),50 nm, 100 nm, 110 nm, 120 nm, 130 nm, 140 nm, 150 nm, 160 nm, 170 nm,180 nm, 190 nm, 200 nm, 210 nm, 220 nm, 230 nm, 240 nm, 250 nm, 260 nm,270 nm, 280 nm, 290 nm, 300 nm, 310 nm, 320 nm, 330 nm, 340 nm, 350 nm,360 nm, 370 nm, 380 nm, 390 nm, 400 nm, 410 nm, 420 nm, 430 nm, 440 nm,450 nm, 460 nm, 470 nm, 480 nm, 490 nm, 500 nm, or more. The averagecenter-to-center spacings may be less than or equal to 500 nm, 490 nm,480 nm, 470 nm, 460 nm, 450 nm, 440 nm, 430 nm, 420 nm, 410 nm, 400 nm,390 nm, 380 nm, 370 nm, 360 nm, 350 nm, 340 nm, 330 nm, 320 nm, 310 nm,300 nm, 290 nm, 280 nm, 270 nm, 260 nm, 250 nm, 240 nm, 230 nm, 220 nm,210 nm, 200 nm, 190 nm, 180 nm, 170 nm, 160 nm, 150 nm, 140 nm, 130 nm,120 nm, 110 nm, 100 nm, 50 nm, or less.

In some embodiments, the concatemers comprise a coating to achieve alower threshold of center-to-center distances between adjacentconcatemers to minimize crosstalk during detection. In some embodiments,after binding the concatemers to the surface, the coating is dissolvedand the CATs attached to the surface and can be sequenced.

Another protein such as BSA may be used, either by chemicallycrosslinking to the CAT or the protein binding partner, or by attachingthe spacer protein (e.g. BSA) to an oligonucleotide complementary to thecommon library adapter sequence through strepavidin interaction. UsingBSA to coat the CAT may have the additional benefit of making a proteingel in the bound layer of CATs which may make the local environment forthe enzymatic reaction more similar to the natural environment of thenucleus where polymerases normally act.

One may also be able to hybridize long single stranded oligonucleotidesthat are partially complementary to the common library adapter sequenceand extend beyond that sequence without homology. In some embodiments,the long single stranded oligonucleotides are the hairs mentioned abovein Paragraph [00113]. Such long oligonucleotides may act to increase thesize of the CAT without altering the number of sequencing substrates itcontains. After surface attachment, these long oligonucleotides may bewashed away, and each CAT may collapse towards the center of itsattachment site, increasing the effective center to center distancebetween adjacent CATs.

DNA may also be used to modify the protein binding partner (bycrosslinking or attachments such as strep-avidin) to create a surfacethat has attractive protein binding sites separated by repellant areas,for instance due to their negative charge.

Deposition of a Closely-Packed Concatemer Layer onto an UnpatternedSurface

One of the limitations to optimum packing density of biological analytesfrom an aqueous solution onto an un-patterned, adherent solid surface isthat the random binding of the analytes onto the surface does notprovide for maximal close-packing due to the inability of the adheredanalyte to move laterally and minimize spacing between bound molecules.As a result, this random irreversible sticking of analytes producesspacing defects in what may otherwise be arranged into a maximallyclose-packed array.

However, many biological analytes, including proteins and nucleic acids,are known to be surface active and migrate to the air-water interfacethat results in a lowering of the surface tension at that interface, toproduce a metastable monolayer of biomolecules. In this case, thesurface-active analytes are free to move laterally at the interface andachieve a maximal close-packed density, with unfavorable hydrophobicinteractions in solution being the driving force for maximal packing.

Therefore, in some embodiments, close-packed, spontaneously formedmonolayer constructs of biomolecules at the air-water interface can betransferred or deposited onto a solid surface by pulling or dragging abolus of the biomolecule solution across the solid surface that isalready in contact with air. Thereby, the close-packed biomoleculeconstruct at the air-water interface is deposited onto the solid surfacefrom the point of three-phase (air-water solid) contact as the bolusmoves across the solid surface.

In some embodiments, a protein layer may be laid down on the surfacebefore the CATs are added. Then the CATs may be added to the alreadylaid down protein layer. This sequential addition may be particularlyeffective if the binding protein is the modified partner.

Sequencing Sequencing Work Flow

In some embodiments, provided herein are methods to detect the sequencesof polynucleotides from the concatemers, e.g., through forming adensely-packed layer on an unpatterned surface and performing cycledsequencing by synthesis (see, e.g., FIG. 23). In some embodiments, saidsurface is patterned.

The detection of targets and their authentication based on repeathybridizations is a key feature enabling target identification andcounting for quantification.

Syncing and Signal Calling (ddNTP Capping of Unreacted Oligonucleotides)

In some embodiments, the sequencing by synthesis includes addition of anirreversible ddNTP terminator after an extension cycle to cap unextendedoligonucleotides. For example, after getting maximal initiation and/orextension with mixture of labeled and cold reversible terminators, acycle of extension (e.g., with a different polymerase that can, betterincorporate ddNTPs) and very high concentrations of all four ddNTPs.This operation may irreversibly terminate the extension of anysequencing template within a CAT that failed to extend at the cycle inquestion. Although this may lead to progressive loss of signal,proportional to the inefficiency of initiation or extension, importantlyit may also reduce background at subsequent cycles of those templateswithin the CAT that ‘skipped’ extension at any cycle, a process whichresults in mixed signal from lagging synthesis on some of the identicaltemplates within the CAT.

This process may lead to increased synchronization of templates within aCAT, yielding less signal from lagging templates, so purer signal fromthe correct base in the sequence. All other things being equal, it maylead to longer effective sequence reads.

Reaction

The CATs have several identical copies of the target DNA, but the lastcopy made during rolling circle amplification is unique in that itcontains an actively extending 3′ end. This ssCircle and its activelyextending end are likely to be near the center of the ball of DNA thatis the CAT, so it is near the center of the exclusion zone within themonolayer of CATs. It is also away from the surface on which thatmonolayer is formed. Raising the actively extending end away from thesurface may increase the accessibility for the chemicals and enzymesused in the sequencing reaction, and also perhaps raise the dye labelsabove the focal plane of background fluorescence on the surface. Theseproperties make it ideal for single-molecule sequencing.

Paired End Sequencing UMI Embodiment

Unique Molecular identifiers (UMIs) have been used to tag molecules toenable identification of duplicate PCR products and also to enabledouble stranded sequencing applications that reduce error.

In some embodiments, adapters that contain UMIs are incorporated intothe circularized DNA template used to form the concatemer.

In one embodiment, UMI A1 and A2 adaptors are added to the 5′ and 3′ends of Strand A and B, as shown in FIG. 24. A1 and A2 can have barcodesfor sample ID. They also have regions used for ligation/circlegeneration and sequencing primer binding regions to enable sequencingboth strands. The adaptors may also have the UMI sequences.

After the completion of sequencing the UMIs can be used to locatecircles emanating from the same DNA fragment and analyzed as paired endreads. Paired end reads are useful for mapping if the read lengths areshort.

Although UMI may be used, many applications, such as NIPT, PCR amplifiedpanels, and large portions of the genome can be reliably sequencedwithout having paired end capability.

Imaging and Cycled Detection

As described herein, each of the detection methods and systems requiredcycled detection to achieve sub-diffraction limited imaging. Cycleddetection includes the binding and imaging or probes, such as antibodiesor nucleotides, bound to detectable labels that can emit a visible lightoptical signal. By using positional information from a series of imagesof a field from different cycles, deconvolution to resolve signals fromdensely packed substrates can be used effectively to identify individualoptical signals from signals obscured due to the diffraction limit ofoptical imaging. After multiple cycles the precise location of themolecule may become increasingly more accurate. Using this informationadditional calculations can be performed to aid in crosstalk correctionregarding known asymmetries in the crosstalk matrix occurring due topixel discretization effects.

Methods for Optical Detection of Analytes

In some embodiments, optical signals are digitized, and analytes areidentified based on a code (ID code) of digital signals for eachanalyte.

As described herein, analytes are deposited to a solid substrate, andprobes are bound to the analytes. Each of the probes comprises tags andspecifically binds to a target analyte. In some embodiments, the tagsare fluorescent molecules that emit the same fluorescent color, and thesignals for additional fluors are detected at each subsequent pass.During a pass, a set of probes comprising tags are contacted with thesubstrate allowing them to bind to their targets. An image of thesubstrate is captured, and the detectable signals are analyzed from theimage obtained after each pass. The information about the presenceand/or absence of detectable signals is recorded for each detectedposition (e.g., target analyte) on the substrate.

In some embodiments, the present disclosure comprises methods thatinclude operations for detecting optical signals emitted from the probescomprising tags, counting the signals emitted during multiple passesand/or multiple cycles at various positions on the substrate, andanalyzing the signals as digital information using a K-bit basedcalculation to identify each target analyte on the substrate. Errorcorrection can be used to account for errors in the optically-detectedsignals, as described below.

In some embodiments, a substrate is bound with analytes comprising Ntarget analytes. To detect N target analytes, M cycles of probe bindingand signal detection are chosen. Each of the M cycles includes 1 or morepasses, and each pass includes N sets of probes, such that each set ofprobes specifically binds to one of the N target analytes. In certainembodiments, there are N sets of probes for the N target analytes.

In each cycle, there is a predetermined order for introducing the setsof probes for each pass. In some embodiments, the predetermined orderfor the sets of probes is a randomized order. In other embodiments, thepredetermined order for the sets of probes is a non-randomized order. Inone embodiment, the non-random order can be chosen by a computerprocessor. The predetermined order is represented in a key for eachtarget analyte. A key is generated that includes the order of the setsof probes, and the order of the probes is digitized in a code toidentify each of the target analytes.

In some embodiments, each set of ordered probes is associated with adistinct tag for detecting the target analyte, and the number ofdistinct tags is less than the number of N target analytes. In thatcase, each N target analyte is matched with a sequence of M tags for theM cycles. The ordered sequence of tags is associated with the targetanalyte as an identifying code.

Quantification of Optically-Detected Probes

After the detection process, the signals from each probe pool arecounted, and the presence or absence of a signal and the color of thesignal can be recorded for each position on the substrate.

From the detectable signals, K bits of information are obtained in eachof M cycles for the N distinct target analytes. The K bits ofinformation are used to determine L total bits of information, such thatK×M=L bits of information and L≥log 2 (N). The L bits of information areused to determine the identity (and presence) of N distinct targetanalytes. If only one cycle (M=1) is performed, then K×1=L. However,multiple cycles (M>1) can be performed to generate more total bits ofinformation L per analyte. Each subsequent cycle provides additionaloptical signal information that is used to identify the target analyte.

In practice, errors in the signals occur, and this confounds theaccuracy of the identification of target analytes. For instance, probesmay bind the wrong targets (e.g., false positives) or fail to bind thecorrect targets (e.g., false negatives). Methods are provided, asdescribed below, to account for errors in optical and electrical signaldetection.

Electrical Detection Methods

In other embodiments, electrical detection methods are used to detectthe presence of target analytes on a substrate. Target analytes aretagged with oligonucleotide tail regions and the oligonucleotide tagsare detected using ion-sensitive field-effect transistors (ISFET, or apH sensor), which measures hydrogen ion concentrations in solution.ISFETs are described in further detail in U.S. Pat. No. 7,948,015, filedon Dec. 14, 2007, to Rothberg et al., and U.S. Publication No.2010/0301398, filed on May 29, 2009, to Rothberg et al., which are bothincorporated by reference in their entireties.

ISFETs present a sensitive and specific electrical detection system forthe identification and characterization of analytes. In one embodiment,the electrical detection methods disclosed herein are carried out by acomputer (e.g., a processor). The ionic concentration of a solution canbe converted to a logarithmic electrical potential by an electrode of anISFET, and the electrical output signal can be detected and measured.

ISFETs have previously been used to facilitate DNA sequencing. Duringthe enzymatic conversion of single-stranded(ss) DNA into double-strandedDNA, hydrogen ions are released as each nucleotide is added to the DNAmolecule. An ISFET detects these released hydrogen ions and candetermine when a nucleotide has been added to the DNA molecule. Bysynchronizing the incorporation of the nucleoside triphosphate (dATP,dCTP, dGTP, and dTTP), the DNA sequence may also be determined. Forexample, if no electrical output signal is detected when thesingle-stranded DNA template is exposed to dATP's, but an electricaloutput signal is detected in the presence of dGTP's, the DNA sequence iscomposed of a complementary cytosine base at the position in question.

In one embodiment, an ISFET is used to detect a tail region of a probeand then identify corresponding target analyte. For example, a targetanalyte can be deposited on a substrate, such as an integrated-circuitchip that contains one or more ISFETs. When the corresponding probe(e.g., aptamer and tail region) is added and specifically binds to thetarget analyte, nucleotides and enzymes (polymerase) are added fortranscription of the tail region. The ISFET detects the release hydrogenions as electrical output signals and measures the change in ionconcentration when the dNTP's are incorporated into the tail region. Theamount of hydrogen ions released corresponds to the lengths and stops ofthe tail region, and this information about the tail regions can be usedto differentiate among various tags.

The simplest type of tail region is one composed entirely of onehomopolymeric base region. In this case, there are four possible tailregions: a poly-A tail, a poly-C tail, a poly-G tail, and a poly-T tail.However, it is often desirable to have a great diversity in tailregions.

One method of generating diversity in tail regions is by providing stopbases within a homopolymeric base region of a tail region. A stop baseis a portion of a tail region comprising at least one nucleotideadjacent to a homopolymeric base region, such that the at least onenucleotide is composed of a base that is distinct from the bases withinthe homopolymeric base region. In one embodiment, the stop base is onenucleotide. In other embodiments, the stop base comprises a plurality ofnucleotides. Generally, the stop base is flanked by two homopolymericbase regions. In an embodiment, the two homopolymeric base regionsflanking a stop base are composed of the same base. In anotherembodiment, the two homopolymeric base regions are composed of twodifferent bases. In another embodiment, the tail region contains morethan one stop base.

In one example, an ISFET can detect a minimum threshold number of 100hydrogen ions. Target Analyte 1 is bound to a composition with a tailregion composed of a 100-nucleotide poly-A tail, followed by onecytosine base, followed by another 100-nucleotide poly-A tail, for atail region length total of 201 nucleotides. Target Analyte 2 is boundto a composition with a tail region composed of a 200-nucleotide poly-Atail. Upon the addition of dTTP's and under conditions conducive topolynucleotide synthesis, synthesis on the tail region associated withTarget Analyte 1 may release 100 hydrogen ions, which can bedistinguished from polynucleotide synthesis on the tail regionassociated with Target Analyte 2, which may release 200 hydrogen ions.The ISFET may detect a different electrical output signal for each tailregion. Furthermore, if dGTP's are added, followed by more dTTP's, thetail region associated with Target Analyte 1 may then release one, then100 more hydrogen ions due to further polynucleotide synthesis. Thedistinct electrical output signals generated from the addition ofspecific nucleoside triphosphates based on tail region compositionsallow the ISFET to detect hydrogen ions from each of the tail regions,and that information can be used to identify the tail regions and theircorresponding target analytes.

Various lengths of the homopolymeric base regions, stop bases, andcombinations thereof can be used to uniquely tag each analyte in asample. Additional description about electrical detection of aptamersand tail regions to identify target analytes in a substrate aredescribed in U.S. Provisional Application No. 61/868,988, which isincorporated by reference in its entirety.

In other embodiments, antibodies are used as probes in the electricaldetection method described above. The antibodies may be primary orsecondary antibodies that bind via a linker region to an oligonucleotidetail region that acts as tag.

These electrical detection methods can be used for the simultaneousdetection of hundreds (or even thousands) of distinct target analytes.Each target analyte can be associated with a digital identifier, suchthat the number of distinct digital identifiers is proportional to thenumber of distinct target analytes in a sample. The identifier may berepresented by a number of bits of digital information and is encodedwithin an ordered tail region set. Each tail region in an ordered tailregion set is sequentially made to specifically bind a linker region ofa probe region that is specifically bound to the target analyte.Alternatively, if the tail regions are covalently bonded to theircorresponding probe regions, each tail region in an ordered tail regionset is sequentially made to specifically bind a target analyte.

In one embodiment, one cycle is represented by a binding and strippingof a tail region to a linker region, such that polynucleotide synthesisoccurs and releases hydrogen ions, which are detected as an electricaloutput signal. Thus, number of cycles for the identification of a targetanalyte is equal to the number of tail regions in an ordered tail regionset. The number of tail regions in an ordered tail region set isdependent on the number of target analytes to be identified, as well asthe total number of bits of information to be generated. In anotherembodiment, one cycle is represented by a tail region covalently bondedto a probe region specifically binding and being stripped from thetarget analyte.

The electrical output signal detected from each cycle is digitized intobits of information, so that after all cycles have been performed tobind each tail region to its corresponding linker region, the total bitsof obtained digital information can be used to identify and characterizethe target analyte in question. The total number of bits is dependent ona number of identification bits for identification of the targetanalyte, plus a number of bits for error correction. The number of bitsfor error correction is selected based on the desired robustness andaccuracy of the electrical output signal. Generally, the number of errorcorrection bits may be 2 or 3 times the number of identification bits.

Decoding the Order and Identity of Detected Analytes

The probes used to detect the analytes are introduced to the substratein an ordered manner in each cycle. A key is generated that encodesinformation about the order of the probes for each target analyte. Thesignals detected for each analyte can be digitized into bits ofinformation. The order of the signals provides a code for identifyingeach analyte, which can be encoded in bits of information.

Error-Correction Methods

In optical and electrical detection methods described above, errors canoccur in binding and/or detection of signals. In some cases, the errorrate can be as high as one in five (e.g., one out of five fluorescentsignals is incorrect). This equates to one error in every five-cyclesequence. Actual error rates may not be as high as 20%, but error ratesof a few percent are possible. In general, the error rate depends onmany factors including the type of analytes in the sample and the typeof probes used. In an electrical detection method, for example, a tailregion may not properly bind to the corresponding probe region on anaptamer during a cycle. In an optical detection method, an antibodyprobe may not bind to its target or bind to the wrong target.

Additional cycles are generated to account for errors in the detectedsignals and to obtain additional bits of information, such as paritybits. The additional bits of information are used to correct errorsusing an error correcting code. In one embodiment, the error correctingcode is a Reed-Solomon code, which is a non-binary cyclic code used todetect and correct errors in a system. In other embodiments, variousother error correcting codes can be used. Other error correcting codesinclude, for example, block codes, convolution codes, Golay codes,Hamming codes, BCH codes, AN codes, Reed-Muller codes, Gappa codes,Hadamard codes, Walsh codes, Hagelbarger codes, polar codes, repetitioncodes, repeat-accumulate codes, erasure codes, online codes, groupcodes, expander codes, constant-weight codes, tornado codes, low-densityparity check codes, maximum distance codes, burst error codes, lubytransform codes, fountain codes, and raptor codes. See Error ControlCoding, 2nd Ed., S. Lin and DJ Costello, Prentice Hall, New York, 2004.Examples are also provided below that demonstrate the method forerror-correction by adding cycles and obtaining additional bits ofinformation.

One example of a Reed-Solomon code includes a RS (15,9) code with 4-bitsymbols, where n=15, k=9, s=4, and t=3, and n=2s−1 and k=n−2t, “n” beingthe number of symbols, “k” being the number of data symbols, “s” beingthe size of each symbol in bits, and “t” being the number of errors thatcan be corrected, and “2t” being the number of parity symbols. There arenine data symbols (k=9) and six parity symbols (2t=6). If base-X numbersare used, and X=4, then each fluorescent color is represented by twobits (0 and 1). A pair of colors may be represented by a four-bit symbolthat includes two high bits and two low bits.

Since base-4 was chosen, seven probe pools, or a sequence of sevencolors, are used to identify each target analyte. This sequence isrepresented by 3½, 4-bit symbols. The remaining 5½ data symbols are setto zero. A Reed-Solomon RS (15,9) encoder then generates the six paritysymbols, represented by 12 additional probe pools. Thus, a total of 19probe pools (7+12) are required to obtain error correction fort=3symbols.

Monte Carlo simulations of error-correcting code performance have beenperformed assuming seven probe pools, to identify up to 16,384 distincttargets. Using these simulations, the maximum permissible raw error rate(associated with identifying a fluorescent label) to achieve a correctederror rate of 10-5 was determined for different numbers of parity bits.

In some embodiments, a key is generated that includes the expected bitsof information associated with an analyte (e.g., the expected order ofprobes and types of signals for the analyte). These expected bits ofinformation for a particular analyte are compared with the actual L bitsof information that are obtained from the target analyte. Using theReed-Solomon approach, an allowance of up tot errors in the signals canbe tolerated in the comparison of the expected bits of information andthe actual L bits of information.

In some embodiments, a Reed-Solomon decoder is used to compare theexpected signal sequence with an observed signal sequence from aparticular probe. For example, seven probe pools may be used to identifya target analyte, the expected color sequence being BGGBBYY, representedby 14 bits. Additional parity pools may then be used for errorcorrection. For example, six 4-bit parity symbols may be used.

Methods and systems using cycled probe binding and optical detection aredescribed in US Publication No. 2015/0330974, Digital Analysis ofMolecular Analytes Using Single Molecule Detection, published Nov. 19,2015, and US Publication No. 2018/0252936, High Speed Scanning WithAcceleration Tracking, published Sep. 6, 2018, are each incorporatedherein by reference herein in its entirety.

In some embodiments, the raw images are obtained using sampling that isat least at the Nyquist limit to facilitate more accurate determinationof the oversampled image. Increasing the number of pixels used torepresent the image by sampling in excess of the Nyquist limit(oversampling) increases the pixel data available for image processingand display.

Theoretically, a bandwidth-limited signal can be perfectly reconstructedif sampled at the Nyquist rate or above it. The Nyquist rate is definedas twice the highest frequency component in the signal. Oversamplingimproves resolution, reduces noise and helps avoid aliasing and phasedistortion by relaxing anti-aliasing filter performance requirements. Asignal is said to be oversampled by a factor of N if it is sampled at Ntimes the Nyquist rate.

Thus, in some embodiments, each image is taken with a pixel size no morethan half the wavelength of light being observed. In some embodiments, apixel size of less than about 200 nm×200 nm is used in detection toachieve sampling at or above the Nyquist limit. Sampling at a frequencyof at least the Nyquist limit during raw imaging of the substrate ispreferred to optimize the resolution of the system or methods describedherein. This can be done in conjunction with the deconvolution methodsand optical systems described herein to resolve features on a substratebelow the diffraction limit with high accuracy.

Processing Images from Different Cycles

There are several barriers overcome by the present invention to achievesub-diffraction limited imaging.

Pixelation error is present in raw images and prevents identification ofinformation present from the optical signals due to pixelation. Samplingat least at the Nyquist frequency and generation of an oversampled imageas described herein each assist in overcoming pixilation error.

The point-spread (PSF) of various molecules overlap because the PSF sizeis greater than the pixel size (below Nyquist) and because thecenter-to-center spacing is so small that crosstalk due to spatialoverlap occurs. Nearest neighbor e.g. variable regression (for center-tocenter crosstalk correction) can be used to help with deconvolution ofmultiple overlapping optical signals. But this can be improved if weknow the relative location of each analyte on the substrate and havegood alignment of images of a field. In some embodiments, machinelearning (e.g. artificial intelligence or “A.I.”) can be used to helpwith deconvolution of multiple overlapping optical signals. In someembodiments, the machine learning processes input data over multiplecycles of probe binding and imaging to deconvolve further images.

After multiple cycles the precise location of the molecule may becomeincreasingly more accurate. Using this information additionalcalculations can be performed to aid in deconvolution by correcting forknown asymmetries in the spatial overlap of optical signals occurringdue to pixel discretization effects and the diffraction limit. They canalso be used to correct for overlap in emission spectrum from differentemission spectrum.

Highly accurate relative positional information for each analyte can beachieved by overlaying images of the same field from different cycles togenerate a distribution of measured peaks from optical signals ofdifferent probes bound to each analyte. This distribution can then beused to generate a peak signal that corresponds to a single relativelocation of the analyte. Images from a subset of cycles can be used togenerate relative location information for each analyte. In someembodiments, this relative position information is provided in alocalization file.

The specific area imaged for a field for each cycle may vary from cycleto cycle. Thus, to improve the accuracy of identification of analyteposition for each image, an alignment between images of a field acrossmultiple cycles can be performed. From this alignment, offsetinformation compared to a reference file can then be identified andincorporated into the deconvolution algorithms to further increase theaccuracy of deconvolution and signal identification for optical signalsobscured due to the diffraction limit. In some embodiments, thisinformation is provided in a Field Alignment File.

Signal Detection (Cross-Talk Nearest Neighbor)

Once relative positional information is accurately determined foranalytes on a substrate and field images from each cycle are alignedwith this positional information, analysis of each oversampled imageusing crosstalk and nearest neighbor regression can be used toaccurately identify an optical signal from each analyte in each image.

In some embodiments, a plurality of optical signals obscured by thediffraction limit of the optical system are identified for each of aplurality of biomolecules deposited on a substrate and bound to probescomprising a detectable label. In some embodiments, the probes areincorporated nucleotides and the series of cycles is used to determine asequence of a polynucleotide deposited on the array using sequencing bysynthesis.

Simulations of Deconvolution Applied to Images

Molecular densities are limited by crosstalk from neighboring molecules.FIG. 3 depicts simulated images of single analytes. This particularimage is a simulation of a layer of analytes on a 600 nm pitch that hasbeen processed with a 2× oversampled filter. Crosstalk into eightadjacent spots is averaged as a function of array pitch and algorithmtype.

FIG. 4 is a series of images processed with multiple pitches and twovariations of image processing algorithms, the first is a 2× oversampledimage and the second is a 4× oversampled image with deconvolution, asdescribed herein. FIG. 5 is the crosstalk analysis of these two types ofimage processing at pitches down to 200 nm. Acceptable crosstalk levelsat or below 25% with 2× oversample occurs for pitches at or above 275nm. Acceptable crosstalk levels at or below 25% with 4× deconvolutionusing the point spread function of the optical system occurs for pitchesat or above 210 nm.

The physical size of the molecule may broaden the spot roughly half thesize of the binding area. For example, for an 80 nm spot the pitch maybe increased by roughly 40 nm. Smaller spot sizes may be used, but thismay have the trade-off that fewer copies may be allowed and greaterillumination intensity may be required. A single copy provides thesimplest sample preparation but requires the greatest illuminationintensity.

Methods for sub-diffraction limit imaging discussed to this pointinvolve image processing techniques of oversampling, deconvolution andcrosstalk correction. Described herein are methods and systems thatincorporate determination of the precise relative location analytes onthe substrate using information from multiple cycles of probe opticalsignal imaging for the analytes. Using this information additionalcalculations can be performed to aid in crosstalk correction regardingknown asymmetries in the crosstalk matrix occurring due to pixeldiscretization effects.

Methods

In some embodiments, as shown in FIG. 6, provided herein is a method foraccurately determining a relative position of analytes deposited on thesurface of a densely packed substrate. The method includes firstproviding a substrate comprising a surface, wherein the surfacecomprises a plurality of analytes deposited on the surface at discretelocations. Then, a plurality of cycles of probe binding and signaldetection on said surface is performed. Each cycle of detection includescontacting the analytes with a probe set capable of binding to targetanalytes deposited on the surface, imaging a field of said surface withan optical system to detect a plurality of optical signals fromindividual probes bound to said analytes at discrete locations on saidsurface, and removing bound probes if another cycle of detection is tobe performed. From each image, a peak location from each of saidplurality of optical signals from images of said field from at least two(i.e., a subset) of said plurality of cycles is detected. The locationof peaks for each analyte is overlaid, generating a cluster of peaksfrom which an accurate relative location of each analyte on thesubstrate is then determined.

In some embodiments, as shown in FIG. 7, the accurate positioninformation for analytes on the substrate is then used in adeconvolution algorithm incorporating position information (e.g., foridentifying center-to-center spacing between neighboring analytes on thesubstrate) can be applied to the image to deconvolve overlapping opticalsignals from each of said images. In some embodiments, the deconvolutionalgorithm includes nearest neighbor variable regression for spatialdiscrimination between neighboring analytes with overlapping opticalsignals.

In some embodiments, as shown in FIG. 8, the method of analyte detectionis applied for sequencing of individual polynucleotides deposited on asubstrate.

In some embodiments, optical signals are deconvolved from densely packedsubstrates as shown in FIG. 11. The operations can be divided into fourdifferent sections as shown in FIG. 9: 1) Image Analysis, which includesgeneration of oversampled images from each image of a field for eachcycle, and generation of a peak file (i.e., a data set) including peaklocation and intensity for each detected optical signal in an image. 2)Generation of a Localization File, which includes alignment of multiplepeaks generated from the multiple cycles of optical signal detection foreach analyte to determining an accurate relative location of the analyteon the substrate. 3) Generation of a Field Alignment file, whichincludes offset information for each image to align images of the fieldfrom different cycles of detection with respect to a selected referenceimage. 4) Extract Intensities, which uses the offset information andlocation information in conjunction with deconvolution modeling todetermine an accurate identity of signals detected from each oversampledimage. The “Extract Intensities” operation can also include other errorcorrection, such as previous cycle regression used to correct for errorsin sequencing by synthesis processing and detection. The operationsperformed in each section are described in further detail below.

Under the image analysis operations shown in FIG. 10A and FIG. 11, theimages of each field from each cycle are processed to increase thenumber of pixels for each detected signal, sharpen the peaks for eachsignal, and identify peak intensities form each signal. This informationis used to generate a peak file for each field for each cycle thatincludes a measure of the position of each analyte (from the peak of theobserved optical signal), and the intensity, from the peak intensityfrom each signal. In some embodiments, the image from each field firstundergoes background subtraction to perform an initial removal of noisefrom the image. Then, the images are processed using smoothing anddeconvolution to generate an oversampled image, which includesartificially generated pixels based on modeling of the signal observedin each image. In some embodiments, the oversampled image can generate 4pixels, 9 pixels, or 16 pixels from each pixel from the raw image.

Peaks from optical signals detected in each raw image or present in theoversampled image are then identified and intensity and positioninformation for each detected analyte is placed into a peak file forfurther processing.

In some embodiments, N raw images corresponding to all images detectedfrom each cycle and each field of a substrate or output into Noversampled images and N peak files for each imaged field. The peak filecomprises a relative position of each detected analyte for each image.In some embodiments, the peak file also comprises intensity informationfor each detected analyte. In some embodiments, one peak file isgenerated for each color and each field in each cycle. In someembodiments, each cycle further comprises multiple passes, such that onepeak file can be generated for each color and each field for each passin each cycle. In some embodiments, the peak file specifies peaklocations from optical signals within a single field.

In preferred embodiments, the peak file includes XY position informationfrom each processed oversampled image of a field for each cycle. The XYposition information comprises estimated coordinates of the locations ofeach detected detectable label from a probe (such as a fluorophore) fromthe oversampled image. The peak file can also include intensityinformation from the signal from each individual detectable label.

Generation of an oversampled image is used to overcome pixelation errorto identify information present that cannot be extracted due topixelation. Initial processing of the raw image by smoothing anddeconvolution helps to provide more accurate information in the peakfiles so that the position of each analyte can be determined with higheraccuracy, and this information subsequently can be used to provide amore accurate determination of signals obscured in diffraction limitedimaging.

In some embodiments, the raw images are obtained using sampling that isat least at the Nyquist limit to facilitate more accurate determinationof the oversampled image. Increasing the number of pixels used torepresent the image by sampling in excess of the Nyquist limit(oversampling) increases the pixel data available for image processingand display.

Theoretically, a bandwidth-limited signal can be perfectly reconstructedif sampled at the Nyquist rate or above it. The Nyquist rate is definedas twice the highest frequency component in the signal. Oversamplingimproves resolution, reduces noise and helps avoid aliasing and phasedistortion by relaxing anti-aliasing filter performance requirements. Asignal is said to be oversampled by a factor of N if it is sampled at Ntimes the Nyquist rate.

Thus, in some embodiments, each image is taken with a pixel size no morethan half the wavelength of light being observed. In some embodiments, apixel size of less than about 200 nm×200 nm is used in detection toachieve sampling at or above the Nyquist limit.

Smoothing uses an approximating function capture important patterns inthe data, while leaving out noise or other fine-scale structures/rapidphenomena. In smoothing, the data points of a signal are modified soindividual points are reduced, and points that are lower than theadjacent points are increased leading to a smoother signal. Smoothing isused herein to smooth the diffraction limited optical signal detected ineach image to better identify peaks and intensities from the signal.

Although each raw image is diffraction limited, described herein aremethods that result in collection of multiple signals from the sameanalyte from different cycles. An embodiment of this method is shown inthe flowchart in FIG. 10B. These multiple signals from each analyte areused to determine a position much more accurate than the diffractionlimited signal from each individual image. They can be used to identifymolecules within a field at a resolution of less than 5 nm. Thisinformation is then stored as a localization file, as shown in FIG. 11.The highly accurate position information can then be used to greatlyimprove signal identification from each individual field image incombination with deconvolution algorithms, such as cross-talk regressionand nearest neighbor variable regression.

As shown in FIG. 11, the operations for generating a localization fileuse the location information provided in the peak files to determinerelative positions of a set of analytes on the substrate. In someembodiments, each localization file contains relative positions fromsets of analytes from a single imaged field of the substrate. Thelocalization file combines position information from multiple cycles togenerate highly accurate position information for detected analytesbelow the diffraction limit.

In some embodiments, the relative position information for each analyteis determined on average to less than a 10 nm standard deviation (i.e.,RMS, or root mean square). In some embodiments, the relative positioninformation for each analyte is determined on average to less than a 10nm 2× standard deviation. In some embodiments, the relative positioninformation for each analyte is determined on average to less than a 10nm 3× standard deviation. In some embodiments, the relative positioninformation for each analyte is determined to less than a 10 nm medianstandard deviation. In some embodiments, the relative positioninformation for each analyte is determined to less than a 10 nm median2× standard deviation. In some embodiments, the relative positioninformation for each analyte is determined to less than a 10 nm median3× standard deviation.

From a subset of peak files for a field from different cycles, alocalization file is generated to determine a location of analytes onthe array. As shown in FIG. 11, in some embodiments, a peak file isfirst normalized using a point spread function to account foraberrations in the optical system. The normalized peak file can be usedto generate an artificial normalized image based on the location andintensity information provided in the peak file. Each image is thenaligned. In some embodiments, the alignment can be performed bycorrelating each image pair and performing a fine fit. Once aligned,position information for each analyte from each cycle can then beoverlaid to provide a distribution of position measurements on thesubstrate. This distribution is used to determine a single peak positionthat provides a highly accurate relative position of the analyte on thesubstrate. In some embodiments, a Poisson distribution is applied to theoverlaid positions for each analyte to determine a single peak.

The peaks determined from at least a subset of position information fromthe cycles are then recorded in a localization file, which comprises ameasure of the relative position of each detected analyte with anaccuracy below the diffraction limit. As described, images from onlysubset of cycles are needed to determine this information.

As shown in FIG. 11, a normalized peak file from each field for eachcycle and color and the normalized localization file can be used togenerate offset information for each image from a field relative to areference image of the field. This offset information can be used toimprove the accuracy of the relative position determination of theanalyte in each raw image for further improvements in signalidentification from a densely packed substrate and a diffraction limitedimage. In some embodiments, this offset information is stored as a fieldalignment file. In some embodiments, the position information of eachanalyte in a field from the combined localization file and fieldalignment file is less than 10 nm RMS, less than 5 nm RMS, or less than2 nm RMS.

In some embodiments, a field alignment file is generated by alignment ofimages from a single field by determining offset information relative toa master file from the field. One field alignment file is generated foreach field. This file is generated from all images of the field from allcycles, and includes offset information for all images of the fieldrelative to a reference image from the field.

In some embodiments, before alignment, each peak file is normalized witha point spread function, followed by generation of an artificial imagefrom the normalized peak file and Fourier transform of the artificialimage. The Fourier transform of the artificial image of the normalizedpeak file is then convolved with a complex conjugate of the Fouriertransform of an artificial image from the normalized localization filefor the corresponding field. This is done for each peak file for eachcycle. The resulting files then undergo an inverse Fourier transform toregenerate image files, and the image files are aligned relative to thereference file from the field to generate offset information for eachimage file. In some embodiments, this alignment includes a fine fitrelative to a reference file.

The field alignment file thus contains offset information for eachoversampled image, and can be used in conjunction with the localizationfile for the corresponding field to generate highly accurate relativeposition for each analyte for use in the subsequent “ExtractIntensities” operations.

As an example where 20 cycles are performed on a field, and one image isgenerated for each of 4 colors to be detected, thus generating 80 imagesof the field, one Field Alignment file is generated for all 80 images(20 cycles*4 colors) taken of the field. In some embodiments, the fieldalignment file contents include: the field, the color observed for eachimage, the operation type in the cycled detection (e.g., binding orstripping), and the image offset coordinates relative to the referenceimage.

In some embodiments, during the alignment process XY “shifts” or“residuals” needed to align 2 images are calculated, and the process isrepeated for remaining images, best fit residual to apply to all iscalculated.

In some embodiments, residuals that exceed a threshold are thrown out,and best fit is re-calculated. This process is repeated until allindividual residuals are within the threshold

Each oversampled image is then deconvolved using the accurate positioninformation from the localization file and the offset information fromthe field alignment file. An embodiment of the intensity extractionoperation is shown in FIG. 10C and FIG. 11. The Point Spread Function(PSF) of various molecules overlap because the center-to-center spacingis so small that the point-spread function of signals from adjacentanalytes overlaps. Nearest neighbor variable regression in combinationwith the accurate analyte position information and/or offset informationcan be used to deconvolve signals from adjacent analytes that have acenter-to-center distance that inhibits resolution due to thediffraction limit. The use of the accurate relative position informationfor each analyte facilitates spatial deconvolution of optical signalsfrom neighboring analytes below the diffraction limit. In someembodiments, the relative position of neighboring analytes is used todetermine an accurate center-to-center distance between neighboringanalytes, which can be used in combination with the point spreadfunction of the optical system to estimate spatial cross-talk betweenneighboring analytes for use in deconvolution of the signal from eachindividual image. This enables the use of substrates with a density ofanalytes below the diffraction limit for optical detection techniques,such as polynucleotide sequencing.

In certain embodiments, emission spectra overlap between differentsignals (i.e. “cross-talk”). For example, during sequencing bysynthesis, the four dyes used in the sequencing process typically havesome overlap in emission spectra.

In particular embodiments, a problem of assigning a color (for example,a base call) to different features in a set of images obtained for acycle when crosstalk occurs between different color channels and whenthe crosstalk is different for different sets of images can be solved bycross-talk regression in combination with the localization and fieldalignment files for each oversampled image to remove overlappingemission spectrums from optical signals from each different detectablelabel used. This further increases the accuracy of identification of thedetectable label identity for each probe bound to each analyte on thesubstrate.

Thus, in some embodiments, identification of a signal and/or itsintensity from a single image of a field from a cycle as disclosedherein uses the following features: 1) Oversampled Image—providesintensities and signals at defined locations. 2) Accurate RelativeLocation—Localization File (provides location information frominformation from at least a subset of cycles) and Field Alignment File(provides offset/alignment information for all images in a field). 3)Image Processing—Nearest Neighbor Variable Regression (spatialdeconvolution) and Cross-talk regression (emission spectradeconvolution) using accurate relative position information for eachanalyte in a field. Accurate identification of probes (e.g., antibodiesfor detection or complementary nucleotides for sequencing) for eachanalyte.

Image Processing Simulations

The effects of the methods and systems disclosed herein are illustratedin simulated cross-talk plots shown in FIG. 12A, FIG. 12B, FIG. 13A andFIG. 13B. For each of these figures, a cross-talk plot showing theintensity of emission spectrum correlated with one of four fluorophoresat each detected analyte in a 10 um×10 um region is shown. Each axiscorresponding to one of the four fluorophores extends to each corner ofthe plot. Thus, a spot located in the center of the plot may have equalcontribution of intensity from all four fluorophores. Emission intensitydetected from an individual fluorophore during an imaging cycle isassigned to move the spot in a direction either towards X, Y; X, —Y; —X,Y; or —X, ¬Y. Thus, separation of populations of spots along these fouraxes indicates a clear deconvolved signal from a fluorophore at ananalyte location. Each simulation is based on detection of 1024molecules in a 10.075 um×10.075 um region, indicating a density of10.088 molecules per micron squared, or an average center-to-centerdistance between molecules of about 315 nm. This is correlated with animaging region of about 62×62 pixels at a pixel size of less than about200 nm×200 nm.

In some embodiments, the average center-to-center distance betweenmolecules is about 150 nm to about 500 nm. In some embodiments, theaverage center-to-center distance between molecules is about 150 nm toabout 175 nm, about 150 nm to about 200 nm, about 150 nm to about 225nm, about 150 nm to about 250 nm, about 150 nm to about 275 nm, about150 nm to about 300 nm, about 150 nm to about 325 nm, about 150 nm toabout 350 nm, about 150 nm to about 375 nm, about 150 nm to about 400nm, about 150 nm to about 500 nm, about 175 nm to about 200 nm, about175 nm to about 225 nm, about 175 nm to about 250 nm, about 175 nm toabout 275 nm, about 175 nm to about 300 nm, about 175 nm to about 325nm, about 175 nm to about 350 nm, about 175 nm to about 375 nm, about175 nm to about 400 nm, about 175 nm to about 500 nm, about 200 nm toabout 225 nm, about 200 nm to about 250 nm, about 200 nm to about 275nm, about 200 nm to about 300 nm, about 200 nm to about 325 nm, about200 nm to about 350 nm, about 200 nm to about 375 nm, about 200 nm toabout 400 nm, about 200 nm to about 500 nm, about 225 nm to about 250nm, about 225 nm to about 275 nm, about 225 nm to about 300 nm, about225 nm to about 325 nm, about 225 nm to about 350 nm, about 225 nm toabout 375 nm, about 225 nm to about 400 nm, about 225 nm to about 500nm, about 250 nm to about 275 nm, about 250 nm to about 300 nm, about250 nm to about 325 nm, about 250 nm to about 350 nm, about 250 nm toabout 375 nm, about 250 nm to about 400 nm, about 250 nm to about 500nm, about 275 nm to about 300 nm, about 275 nm to about 325 nm, about275 nm to about 350 nm, about 275 nm to about 375 nm, about 275 nm toabout 400 nm, about 275 nm to about 500 nm, about 300 nm to about 325nm, about 300 nm to about 350 nm, about 300 nm to about 375 nm, about300 nm to about 400 nm, about 300 nm to about 500 nm, about 325 nm toabout 350 nm, about 325 nm to about 375 nm, about 325 nm to about 400nm, about 325 nm to about 500 nm, about 350 nm to about 375 nm, about350 nm to about 400 nm, about 350 nm to about 500 nm, about 375 nm toabout 400 nm, about 375 nm to about 500 nm, or about 400 nm to about 500nm. In some embodiments, the average center-to-center distance betweenmolecules is about 150 nm, about 175 nm, about 200 nm, about 225 nm,about 250 nm, about 275 nm, about 300 nm, about 325 nm, about 350 nm,about 375 nm, about 400 nm, or about 500 nm. In some embodiments, theaverage center-to-center distance between molecules is at least about150 nm, about 175 nm, about 200 nm, about 225 nm, about 250 nm, about275 nm, about 300 nm, about 325 nm, about 350 nm, about 375 nm, or about400 nm. In some embodiments, the average center-to-center distancebetween molecules is at most about 175 nm, about 200 nm, about 225 nm,about 250 nm, about 275 nm, about 300 nm, about 325 nm, about 350 nm,about 375 nm, about 400 nm, or about 500 nm.

FIG. 12A shows the cross-talk plot of fluorophore intensity between thefour fluorophores from optical signals detected from the raw image. FIG.12B and FIG. 13A each shows the separation between the four fluorophoresachieved by generating a 4× oversampled image, indicating theachievement of some removal of cross-talk at each analyte. FIG. 13Bshows a cross-talk plot for the same imaging region but withdeconvolution and nearest neighbor regression performed as shown in FIG.11 and described herein. As compared with FIG. 13A and FIG. 12A, eachanalyte detected shows clear separation of its optical signal from theother fluorophores, indicating a highly accurate fluorophoreidentification for each analyte.

FIG. 14A and FIG. 14B show a simulated four-color composite of eachdetected 10.075 μm×10.075 um region as simulated above. This visuallyrepresents the clarity between analytes form the raw image (FIG. 14A)and the image processed as described herein (FIG. 14B).

Sequencing

The methods described above and in FIG. 11 also facilitate sequencing bysequencing by synthesis using optical detection of complementaryreversible terminators incorporated into a growing complementary strandon a substrate comprising densely packed polynucleotides. Thus, signalscorrelating with the sequence of neighboring polynucleotides at acenter-to-center distance below the diffraction limit can be reliablydetected using the methods and optical detection systems describedherein. Image processing during sequencing can also include previouscycle regression based on clonal sequences repeated on the substrate oron the basis of the data itself to correct for errors in the sequencingreaction or detection. In some embodiments, the polynucleotidesdeposited on the substrate for sequencing are concatemers. A concatemercan comprise multiple identical copies of a polynucleotide to besequenced. Thus, each optical signal identified by the methods andsystems described herein can refer to a single detectable label (e.g., afluorophore) from an incorporated nucleotide, or can refer to multipledetectable labels bound to multiple locations on a single concatemer,such that the signal is an average from multiple locations. Theresolution that may occur may not be between individual detectablelabels, but between different concatemers deposited to the substrate.

In some embodiments, molecules to be sequenced, single or multiplecopies, may be bound to the surface using covalent linkages, byhybridizing to capture oligonucleotide on the surface, or by othernon-covalent binding. The bound molecules may remain on the surface forhundreds of cycles and can be re-interrogated with different primersets, following stripping of the initial sequencing primers, to confirmthe presence of specific variants.

In one embodiment, the fluorophores and blocking groups may be removedusing chemical reactions.

In another embodiment, the fluorescent and blocking groups may beremoved using UV light.

In one embodiment, the molecules to be sequenced may be deposited onreactive surfaces that have 50-100 nM diameters and these areas may bespaced at a pitch of 150-300 nM. These molecules may have barcodes,attached onto them for target de-convolution and a sequencing primerbinding region for initiating sequencing. Buffers may containappropriate amounts of DNA polymerase to enable an extension reaction.These may contain 10-100 copies of the target to be sequenced generatedby any of the gene amplification methods available (PCR, whole genomeamplification etc.)

In another embodiment, single target molecules, tagged with a barcodeand a primer annealing site may be deposited on a 20-50 nM diameterreactive surface spaced with a pitch of 60-150 nM. The molecules may besequenced individually.

In one embodiment, a primer may bind to the target and may be extendedusing one dNTP at a time with a single or multiple fluorophore (s); thesurface may be imaged, the fluorophore may be removed and washed and theprocess repeated to generate a second extension. The presence ofmultiple fluorophores on the same dNTP may enable defining the number ofrepeats nucleotides present in some regions of the genome (2 to 5 ormore).

In a different embodiment, following primer annealing, all four dNTPswith fluorophores and blocked 3′ hydroxyl groups may be used in thepolymerase extension reaction, the surface may be imaged and thefluorophore and blocking groups removed and the process repeated formultiple cycles.

In another embodiment, the sequences may be inferred based on ligationreactions that anneal specific probes that ligate based on the presenceof a specific nucleotides at a given position.

A random array may be used which may have improved densities over priorart random arrays using the techniques outlined above, however randomarrays generally have 4× to 10× reduced areal densities of orderedarrays. Advantages of a random array include a uniform, non-patternedsurface for the chip and the use of shorter nucleic acid strands becausethere is no need to rely on the exclusionary properties of longerstrands.

Computer Systems

The present disclosure provides computer systems that are programmed toimplement methods of the disclosure. FIG. 28 shows a computer system2801 that is programmed or otherwise configured to direct the methodsdescribed herein and utilize the systems described herein. The computersystem 2801 can regulate various aspects of the present disclosure, suchas, for example, directing the cycles of probe binding described herein.The computer system 2801 can be an electronic device of a user or acomputer system that is remotely located with respect to the electronicdevice. The electronic device can be a mobile electronic device.

The computer system 2801 includes a central processing unit (CPU, also“processor” and “computer processor” herein) 2805, which can be a singlecore or multi core processor, or a plurality of processors for parallelprocessing. The computer system 2801 also includes memory or memorylocation 2810 (e.g., random-access memory, read-only memory, flashmemory), electronic storage unit 2815 (e.g., hard disk), communicationinterface 2820 (e.g., network adapter) for communicating with one ormore other systems, and peripheral devices 2825, such as cache, othermemory, data storage and/or electronic display adapters. The memory2810, storage unit 2815, interface 2820 and peripheral devices 2825 arein communication with the CPU 2805 through a communication bus (solidlines), such as a motherboard. The storage unit 2815 can be a datastorage unit (or data repository) for storing data. The computer system2801 can be operatively coupled to a computer network (“network”) 2830with the aid of the communication interface 2820. The network 2830 canbe the Internet, an internet and/or extranet, or an intranet and/orextranet that is in communication with the Internet. The network 2830 insome cases is a telecommunication and/or data network. The network 2830can include one or more computer servers, which can enable distributedcomputing, such as cloud computing. The network 2830, in some cases withthe aid of the computer system 2801, can implement a peer-to-peernetwork, which may enable devices coupled to the computer system 2801 tobehave as a client or a server.

The CPU 2805 can execute a sequence of machine-readable instructions,which can be embodied in a program or software. The instructions may bestored in a memory location, such as the memory 2810. The instructionscan be directed to the CPU 2805, which can subsequently program orotherwise configure the CPU 2805 to implement methods of the presentdisclosure. Examples of operations performed by the CPU 2805 can includefetch, decode, execute, and writeback.

The CPU 2805 can be part of a circuit, such as an integrated circuit.One or more other components of the system 2801 can be included in thecircuit. In some cases, the circuit is an application specificintegrated circuit (ASIC).

The storage unit 2815 can store files, such as drivers, libraries andsaved programs. The storage unit 2815 can store user data, e.g., userpreferences and user programs. The computer system 2801 in some casescan include one or more additional data storage units that are externalto the computer system 2801, such as located on a remote server that isin communication with the computer system 2801 through an intranet orthe Internet.

The computer system 2801 can communicate with one or more remotecomputer systems through the network 2830. For instance, the computersystem 2801 can communicate with a remote computer system of a user.Examples of remote computer systems include personal computers (e.g.,portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® GalaxyTab), telephones, Smart phones (e.g., Apple® iPhone, Android-enableddevice, Blackberry®), or personal digital assistants. The user canaccess the computer system 2801 via the network 2830.

Methods as described herein can be implemented by way of machine (e.g.,computer processor) executable code stored on an electronic storagelocation of the computer system 2801, such as, for example, on thememory 2810 or electronic storage unit 2815. The machine executable ormachine readable code can be provided in the form of software. Duringuse, the code can be executed by the processor 2805. In some cases, thecode can be retrieved from the storage unit 2815 and stored on thememory 2810 for ready access by the processor 2805. In some situations,the electronic storage unit 2815 can be precluded, andmachine-executable instructions are stored on memory 2810.

The code can be pre-compiled and configured for use with a machinehaving a processer adapted to execute the code, or can be compiledduring runtime. The code can be supplied in a programming language thatcan be selected to enable the code to execute in a pre-compiled oras-compiled fashion.

Aspects of the systems and methods provided herein, such as the computersystem 2801, can be embodied in programming. Various aspects of thetechnology may be thought of as “products” or “articles of manufacture”typically in the form of machine (or processor) executable code and/orassociated data that is carried on or embodied in a type of machinereadable medium. Machine-executable code can be stored on an electronicstorage unit, such as memory (e.g., read-only memory, random-accessmemory, flash memory) or a hard disk. “Storage” type media can includeany or all of the tangible memory of the computers, processors or thelike, or associated modules thereof, such as various semiconductormemories, tape drives, disk drives and the like, which may providenon-transitory storage at any time for the software programming. All orportions of the software may at times be communicated through theInternet or various other telecommunication networks. Suchcommunications, for example, may enable loading of the software from onecomputer or processor into another, for example, from a managementserver or host computer into the computer platform of an applicationserver. Thus, another type of media that may bear the software elementsincludes optical, electrical and electromagnetic waves, such as usedacross physical interfaces between local devices, through wired andoptical landline networks and over various air-links. The physicalelements that carry such waves, such as wired or wireless links, opticallinks or the like, also may be considered as media bearing the software.As used herein, unless restricted to non-transitory, tangible “storage”media, terms such as computer or machine “readable medium” refer to anymedium that participates in providing instructions to a processor forexecution.

Hence, a machine readable medium, such as computer-executable code, maytake many forms, including but not limited to, a tangible storagemedium, a carrier wave medium or physical transmission medium.Non-volatile storage media include, for example, optical or magneticdisks, such as any of the storage devices in any computer(s) or thelike, such as may be used to implement the databases, etc. shown in thedrawings. Volatile storage media include dynamic memory, such as mainmemory of such a computer platform. Tangible transmission media includecoaxial cables; copper wire and fiber optics, including the wires thatcomprise a bus within a computer system. Carrier-wave transmission mediamay take the form of electric or electromagnetic signals, or acoustic orlight waves such as those generated during radio frequency (RF) andinfrared (IR) data communications. Common forms of computer-readablemedia therefore include for example: a floppy disk, a flexible disk,hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD orDVD-ROM, any other optical medium, punch cards paper tape, any otherphysical storage medium with patterns of holes, a RAM, a ROM, a PROM andEPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wavetransporting data or instructions, cables or links transporting such acarrier wave, or any other medium from which a computer may readprogramming code and/or data. Many of these forms of computer readablemedia may be involved in carrying one or more sequences of one or moreinstructions to a processor for execution.

The computer system 2801 can include or be in communication with anelectronic display 2835 that comprises a user interface (UI) 2840 forproviding, for example, the detectable signal sequences mentioned hereinor the identification of analytes as mentioned herein or the location ofanalytes as disclosed herein or any other information disclosed herein.Examples of UI's include, without limitation, a graphical user interface(GUI) and web-based user interface.

Methods and systems of the present disclosure can be implemented by wayof one or more algorithms. An algorithm can be implemented by way ofsoftware upon execution by the central processing unit 2805. Thealgorithm can, for example, direct the optical modules disclosed hereinto capture an image or direct probe binding.

Equivalents and Scope

Those skilled in the art will recognize, or be able to ascertain usingno more than routine experimentation, many equivalents to the specificembodiments in accordance with the present disclosure described herein.The scope of the present disclosure is not intended to be limited to theabove Description, but rather is as set forth in the appended claims.

In the claims, articles such as “a,” “an,” and “the” may mean one ormore than one unless indicated to the contrary or otherwise evident fromthe context. Claims or descriptions that include “or” between one ormore members of a group are considered satisfied if one, more than one,or all of the group members are present in, employed in, or otherwiserelevant to a given product or process unless indicated to the contraryor otherwise evident from the context. The present disclosure includesembodiments in which exactly one member of the group is present in,employed in, or otherwise relevant to a given product or process. Thepresent disclosure includes embodiments in which more than one, or allof the group members are present in, employed in, or otherwise relevantto a given product or process.

Where ranges are given, endpoints are included. Furthermore, it is to beunderstood that unless otherwise indicated or otherwise evident from thecontext and understanding of one of ordinary skill in the art, valuesthat are expressed as ranges can assume any specific value or subrangewithin the stated ranges in different embodiments of the presentdisclosure, to the tenth of the unit of the lower limit of the range,unless the context clearly dictates otherwise.

All cited sources, for example, references, publications, databases,database entries, and art cited herein, are incorporated into thisapplication by reference, even if not expressly stated in the citation.In case of conflicting statements of a cited source and the instantapplication, the statement in the instant application shall control.

Section and table headings are not intended to be limiting.

EXAMPLES

Below are examples of specific embodiments for carrying out the presentinvention. The examples are offered for illustrative purposes only, andare not intended to limit the scope of the present invention in any way.Efforts have been made to ensure accuracy with respect to numbers used(e.g., amounts, temperatures, etc.), but some experimental error anddeviation should, of course, be allowed for.

The practice of the present disclosure may employ, unless otherwiseindicated, conventional methods of protein chemistry, biochemistry,recombinant DNA techniques and pharmacology, within the skill of theart. Such techniques are explained fully in the literature. See, e.g.,T. E. Creighton, Proteins: Structures and Molecular Properties (W.H.Freeman and Company, 1993); A. L. Lehninger, Biochemistry (WorthPublishers, Inc., current addition); Sambrook, et al., MolecularCloning: A Laboratory Manual (2nd Edition, 1989); Methods In Enzymology(S. Colowick and N. Kaplan eds., Academic Press, Inc.); Remington'sPharmaceutical Sciences, 18th Edition (Easton, Pa.: Mack PublishingCompany, 1990); Carey and Sundberg Advanced Organic Chemistry 3rd Ed.(Plenum Press) Vols A and B(1992).

Example 1: Dense Packing of Molecules

Methods below will describe how to utilize a square ordered array wherethe pitch ranges between 200 nm and 333 nm. Additional methods will bedescribed that allow even smaller pitches. An imaging system isdescribed in International Application PCT/US2018/020737, filed Mar. 2,2018 and incorporated herein by reference, which will be used as areference system which enables sub-diffraction limit imaging. Theoptical system can include multiple 2,048 by 2,048 pixel camerasoperating up to 100 Hz frames per second (fps) with field size 332.8 umby 332.8 um. This system is capable of measuring as little as a singlefluor at and above 90 fps. Using this system with 1-10 copies (or 1-10fluorophores) per molecule at 85 fps achieves the necessary throughputto image a 63 mm×63 mm slide in under 15 minutes. Biochemistry cyclesand imaging are continuously and simultaneously performed, either byusing two chips or by dividing a single chip into at least 2 regions.

Example 2: Single-Molecule Sequencing Using Sequencing by Synthesis

Single-molecule sequencing using sequencing-by-synthesis approach wasevaluated on the Apton System. To test the methodology, single-strandedDNA templates with 5′ phosphate group were first attached to the chipwith a tecarbohydrazide activated silicon surface of the flow cellthrough EDC (1-Ethyl-3-(3-mplate dimethylaminopropyl)carbodiimide)chemistry. The sequencing primer was the annealed the target depositedon the surface. The sequencing templates used in our initial studiesincluded synthetic oligonucleotide containing EGFR L858R, EGFR T790M,and BRAF V600E mutations and two cDNA samples reversed transcribed fromERCC 00013 and ERCC 00171 control RNA transcripts. After DNA templateimmobilization and primer annealing, the flow cell is loaded on theApton instrument for sequencing reactions, which involves multiplecycles of enzymatic single nucleotide incorporation reaction, imaging todetect fluorescence dye detection, followed by chemical cleavage.Therminator IX DNA Polymerase from NEB was used for single baseextension reaction, which is a 9° NTM DNA Polymerase variant with anenhanced ability to incorporate modified dideoxynucleotides. Four dNTPsused in the reaction are labeled with 4 different cleavable fluorescentdyes and blocked at 3′ —OH group with a cleavable moiety (dCTP-AF488,dATP-AFCy3, dTTP-TexRed, and dGTP-Cy5 from MyChem). During eachsequencing reaction cycle, a single labeled dNTP is incorporated and thereaction is terminated because of the 3′-blocking group on dNTP. AfterdNTP incorporation, the unincorporated nucleotides are removed from theflow-cell by washing and the incorporated fluorescent dye labelednucleotide is imaged to identify the base. After the images arecaptured, the fluorescent dye and blocking moiety are cleaved from theincorporated nucleotide using 100 mM TCEP((tris(2-carboxyethyl)phosphine), pH9.0), allowing subsequent additionof the next complementary nucleotide in next cycle. This extension,detection and cleavage cycle is then repeated to increase the readlength.

FIG. 15A shows results of sequencing of a 1:1 mixture of syntheticoligonucleotide templates corresponding to the region around codon 790in the EGFR gene containing equal amounts of mutant and wild type (WT)targets. Images from incorporation of dye labeled nucleotides used tosequence synthetic templates corresponding to a region of the EGFR genenear codon 790 with a mutation at the first base (C-incorporation in WT& T-incorporation in mutant) after the primer. The montage in FIG. 15Adepicts images from alternating base incorporation and cleavage cycles.This data exhibits the ability of the system to detect 10 cycles of baseincorporation. Arrows indicate the base change observed.

The synthetic oligonucleotides used were around 60 nucleotides long. Aprimer that had a sequence ending one base prior to the mutation incodon 790 was used to enable the extension n reaction. The surface wasimaged post incorporation of nucleotides by the DNA polymerase and afterthe cleavage reaction with TCEP. The yellow circle indicates thelocation of the template molecule that was aligned using data from 10consecutive cycles of dye incorporation. Molecules were identified withknown color incorporation sequences, following that the actual baseincorporations are identified by visual inspections which islabor-intensive.

Dye labeled nucleotides were used to sequence cDNA generated from RNAtemplates. RNA used was generated by T7 transcription from cloned ERCCcontrol plasmids. FIG. 15B depicts images from alternating baseincorporation and cleavage cycles. The data exhibits the ability of thesystem to detect 10 cycles of base incorporation. The sequence observedwere correct. Yellow arrows indicate the cleavage cycles.

Specifically, cDNA templates corresponding to transcripts generated fromthe ERCC (External RNA Controls Consortium) control plasmids by T7transcription were sequenced. The cDNA molecule generated were >350nucleotides long. The surface was imaged post incorporation ofnucleotides by the DNA polymerase and after the cleavage reaction withTCEP. The yellow circle in FIG. 15B indicates the location of thetemplate molecule that was aligned using data from 10 consecutive cyclesof dye incorporation. Data indicated ability to manually detect 10cycles of nucleotide incorporation by manual viewing of images

Example 3: Relative Location Determination for Analyte Variants

FIG. 16 is an image of single molecules deposited on a substrate andbound by a probe comprising a fluorophore. The molecules are anti-ERKantibodies bound to ERK protein from cell lysate which has beencovalently attached to the solid support. The antibodies are labeledwith 3-5 fluorophores per molecule. Similar images are attainable withsingle fluor nucleic acid targets, e.g., during sequencing by synthesis.

To improve accuracy of detection, the molecules undergo successivecycles of probe binding and stripping, in this case 30 cycles. In eachround, the image is processed to determine the location of themolecules. The images are background subtracted, oversampled by 2×,after which peaks are identified. Multiple layers of cycles are overlaidon a 20 nm grid. The location variance is the standard deviation, or theradius divided by the square root of the number of measurements. FIG.17, right panel, shows each peak from each cycle overlaid. The leftpanel is the smoothed version of the right panel. Each bright spotrepresents a molecule. The molecule locations are resolvable withmolecule-to-molecule distances under 200 nm. FIG. 18 shows localizationvariation for each of a plurality of molecules found in a field. Themedian localization variance is 5 nm and the 3 sigma localizationvariance is under 10 nm.

Example 4: Densely-Packed Sequencing Substrates and Single-Sided DensitySingle-Stranded Circle Formation:

To prepare a library of concatemers comprising target sequences todistribute on the surface of a substrate in a randomly distributedclose-packed layer, a sample comprising target sequences was amplified,purified, ligated to form circularized DNA, and quantified, as shown inFIG. 23A.

Amplification of Targets

An Illumina MiSeq library was purchased from SegMatic (Fremont, Calif.)made with the standard protocol using E. coli DNA purchased fromAffymetrix (Santa Clara, Calif.—PN 14380)

The library was amplified by PCR amplification. Each PCR reactionincluded the following components listed in Table 1:

TABLE 1 One 50 ul Reaction (uL) 10X Pfx Amplification buffer 10 10 mMdNTP (Invitrogen) 1.5 50 mM MgSO4 (stored at 4° C.) 1 Primer mix (1004)1.5 Template DNA 1-5 Platinum Pfx DNA Polymerase (Invitrogen - 0.4ThermoFisher) Pfx Enhancer (Invitrogen - ThermoFisher) 2.5 Water Fillwith water to 50 ul_,

The primer mix is a 50:50 mix of P5-Phosphate (/5Phos/AAT GAT ACG GCGACC ACC GA) and P7 (CAA GCA GAA GAC GGC ATA CGA GAT) primers at 10 uM:

The PCR amplification was performed under the following conditions: 5 mMat 94° C. followed by 35 cycles of: 94° C., 15 sec; 55° C., 30 sec; and68° C., 30 sec. An aliquot of the amplification product was run on a 2%gel to verify the library molecule size (300-500 base pairs in thisinstance). The PCR amplification product was then purified using aPureLink® Spin Column (Thermofisher) according to the manufacturer'sprotocol.

Circularization of Target DNA

The purified PCR amplification products were then subject to singlestrand circularization by ligation in the reaction mix described inTable 2:

TABLE 2 Single reaction (uL) 10 X HiFi Taq DNA Ligase Buffer 5 DNAtemplate (104) 10 Bridging oligonucleotide (100 uM) 1 HiFi Taq DNALigase (New England Biolabs, 1 Ipswich MA) H2O 33 Total vol (at) 50

The bridging oligonucleotide sequence was TCG GTG GTC GCC GTA TCA TTCAAG CAG AAG ACG GCA TAC GAG AT.

The ligation was performed under the following conditions: 30 sec at 95°C. followed by 40 cycles of: 95° C., 15 sec; 55° C., 2 min; and 62° C.,3 min.

After ligation, 1 μL each of Exonuclease I and Exonuclease III (NewEngland Biolabs) were added and the reaction is incubated for anadditional 45 min at 37° C. and 30 min at 85° C. The resulting materialwas purified using a Zymo-Spin™ Column (Oligo Clean & Concentrator™ kitZymo Research, Irvine, Calif.) using the manufacturer's protocol. Afterpurification, the concentration was measured using a Qubit 2.0fluorometer (ThermoFisher) and Quant-iT OliGreen® (ThermoFisher) withcustom calibration samples using an oligonucleotide of knownconcentration.

Concatemer Formation from Circularized DNA

Concatemers from circularized DNA comprising the the target sequencewere formed in a reaction mix described in Table 3:

TABLE 3 volume buffer Additional components circular 10 μL  watertemplate primer solution 5 μL 3X reaction buffer Enzyme mix 5 μL 1xreaction buffer 2 U/ul Phi29 DNA polymerase 2 mM in each dNTP 0.004 U/uLiPPase (all from New England Biolabs, Ipswich, MA) Reaction 5 μL 0.25MEDTA, pH 8.0 inactivation (Sigma-Aldrich, St. buffer Louis, MO)

The primer solution was a 750 nM suspension of the primer (ATC TCG TATGCC GTC TTC TGC TTG) in 3× reaction buffer. The 10× reaction buffer was:500 mM Tris-HCl, 100 mM (NH4)2SO4, 40 mM DTT, 100 mM MgCl2, pH 7.5 @ 25°C.

The circular template+primer mix was incubated for 10 mM at 90° C., andthen 30 min at 30° C. A pre-warmed enzyme mix was then added as in Table3 for 90 mM. The reaction was stopped with the addition of reactioninactivation buffer and stored at 4° C.

Concatemer libraries were then layered on a substrate to form adensely-packed, randomly distributed layer bound to the surface of asubstrate, followed by sequencing the bound concatemers via imaging andimage processing, and analysis of the data, as shown in FIG. 23B and asdescribed below.

One microliter of the sequencing substrate was mixed with 19 ul ofcitrate phosphate buffer, and 10 ul was loaded onto a custom biochip andincubated overnight. The chip was then washed 2× with citrate phosphatebuffer, 2× with potassium phosphate buffer and 2× with NA wash 3 buffer.

Fluorescent probe was bound to the concatemer layer bound to the surfaceof the chip to determine identity. Images showing the density are shownin FIGS. 25A-25C. FIG. 25D shows a plot of measured density of a 1-sidedconcatemer layer according to methods described herein (Apton—controltarget) and simulated distributions at higher densities (Apton—Sim).

Example 5: Sequencing E. Coli Reads Imaging/Sequencing

Sequencing by synthesis was performed using standard sequencingchemistries. The chip comprising the densely packed concatemer layer wasloaded into the AptonBio Sequencer and washed 6×5 mM at 60° C. withWash1 (20 mM Tris-HCl, 10 mM (NH4)2 SO4, 10 mM KCl, 2 mM MgSo4, 0.1%100, pH 8.8 @ 25° C., 50 mM NaCl). The sequencing oligo (ATC TCG TAT GCCGTC TTC TGC TTG) was diluted to 100 nM in hybridization buffer andincubated 1×1 mM followed by 2×10 mM at 60° C. with Wash1 washes betweenhybridization operations. Then thirty-two cycles of the following 8operations were performed:

1—Cleavage: 225 sec at 60° C. with buffer in Table 4

TABLE 4 Concentration Amount (Working) TCEP [add vendor] 31.53 mg 100 mMlON NaOH 40 uL 5M Nacl 11 uL 50 mM 1M Tris-HCL 11 −1M Total Volume 1100

2—Wash: 240 sec at 30° C. in Phosphate buffer pH 8.

3—Imaging: Wash2 (20 mM Tris-HCl, 5 mM Ascorbic Acid (pH 8.8)

4—Wash: Wash1 at 60° C.

5—Extension: 450 sec at 60° C. with buffer in Table 5

TABLE 5 Concen- Concen- Vol/ tration tration reaction (Stock) (Working)(μL) ThermoPol Reaction Buffer 10 x 1 x 5 (NEB) dATP labeled reversible5 μM 0.1 uM 1 terminator (MyChem, LLC, San Diego) dGTP labeledreversible 5 uM 0.1 uM 1 terminator (MyChem, LLC, San Diego) dTTPlabeled reversible 5 μM 0.1 uM 1 terminator (MyChem, LLC, San Diego)dCTP labeled reversible 5 uM 0.1 uM 1 terminator (MyChem, LLC, SanDiego) NaCl 5M 0.05M 0.5 Therminator X (New England 10 U/μ · L 0.05 U1.25 Biolabs Ipswich, MA) Non-labeled dNTP Mix 10 uM 0 to 1 uM 0-5(MyChem, LLC, San Diego) Water 38.75-34.24 Total 50

6—Wash: Wash1 at 30° C.

7—Wash: 2 min at 30° C. in Phosphate buffer pH8.

8—Imaging: Wash 2.

Results:

Reads of 30-40 bp are shown in FIG. 27A. Reads of 20-25 bp are shown inFIG. 27B.

Crossplots shown in FIG. 27C show the resolution of base calling atindividual spots for E. coli sequencing.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. It is not intendedthat the invention be limited by the specific examples provided withinthe specification. While the invention has been described with referenceto the aforementioned specification, the descriptions and illustrationsof the embodiments herein are not meant to be construed in a limitingsense. Numerous variations, changes, and substitutions will now occur tothose skilled in the art without departing from the invention.Furthermore, it shall be understood that all aspects of the inventionare not limited to the specific depictions, configurations or relativeproportions set forth herein which depend upon a variety of conditionsand variables. It should be understood that various alternatives to theembodiments of the invention described herein may be employed inpracticing the invention. It is therefore contemplated that theinvention shall also cover any such alternatives, modifications,variations or equivalents. It is intended that the following claimsdefine the scope of the invention and that methods and structures withinthe scope of these claims and their equivalents be covered thereby.

1.-19. (canceled)
 20. A method for accurately determining a relativeposition of analytes deposited on a surface of a densely packedsubstrate, comprising: (a) providing a substrate comprising a surface,wherein the surface is patterned or unpatterned and comprises aplurality of analytes deposited on the surface at discrete locations;(b) performing a plurality of cycles of probe binding and signaldetection on said surface, each cycle comprising: (i) contacting saidanalytes with a plurality of probes from a probe set, wherein saidprobes comprise a detectable label, wherein each of said probes bindsspecifically to a target analyte; and (ii) imaging a field of saidsurface with an optical system to detect a plurality of optical signalsfrom individual probes bound to said analytes at discrete locations onsaid surface; (c) determining a peak location from each of saidplurality of optical signals from images of said field from at least twoof said plurality of cycles; and (d) overlaying said peak locations foreach optical signal and applying an optical distribution model at eachcluster of optical signals to determine a relative position of eachdetected analyte on said surface with improved accuracy.
 21. The methodof claim 1, further comprising: (e) resolving said optical signals ineach field image from each cycle using said determined relative positionand a resolving function; and (f) identifying said detectable labelsbound to said deposited analytes for each field and each cycle from saiddeconvolved optical signals.
 22. The method of claim 1, wherein one ormore analytes of said plurality of analytes are treated with a repellantor attractive substance.
 23. The method of claim (f), wherein saidrepellant or attractive substance comprises zwitterionic features. 24.The method of claim (f), wherein said repellant or attractive substancecomprises PEG, a polysaccharide, ampholine ampholytes, sulphobetaine,and/or BSA.
 25. The method of claim 1, wherein said analytes are DNAconcatemers.
 26. The method of claim 24, wherein said DNA concatemersare hybridized to ssDNA hairs.
 27. The method of claim 1, wherein saidanalytes are proteins or peptides.
 28. The method of claim 21, furthercomprising using said detectable label identity for each analytedetected at each cycle to identify a plurality of said analytes on saidsubstrate.
 29. The method of claim 21, wherein said resolving comprisesremoving interfering optical signals from neighboring analytes using acenter-to-center distance between said neighboring analytes from saiddetermined relative positions of said neighboring analytes.
 30. Themethod of claim 21, wherein said resolving function comprisesdeconvolution.
 31. The method of claim 1, wherein said analytes aresingle biomolecules.
 32. The method of claim 1, wherein said analytesdeposited on said surface are spaced apart on average less than thediffraction limit of the light emitted by the detectable labels andimaged by the optical system.
 33. The method of claim 1, wherein thedeposited analytes comprises an average center-to-center distancebetween each analyte and the nearest adjacent analyte of less than 500nm.
 34. The method of claim 1, wherein said overlaying said peaklocations comprises aligning positions of said optical signal peaksdetected in each field for a plurality of said cycles to generate acluster of optical peak positions for each analyte from said pluralityof cycles.
 35. The method of claim 1, wherein said relative position isdetermined with an accuracy of within 10 nanometers RMS.
 36. The methodof claim 1, wherein said method resolves optical signals from a surfaceat a density of about 4 to about 25 analytes per square micron.
 37. Asystem for determining the identity of a plurality of analytes,comprising (a) an optical imaging device configured to image a pluralityof optical signals from a field of a substrate over a plurality ofcycles of probe binding to analytes deposited on a surface of thesubstrate, wherein said surface is unpatterned; and (b) an imageprocessing module, said module configured to: (i) determine a peaklocation from each of said plurality of optical signals from images ofsaid field from at least two of said plurality of cycles; (ii) determinea relative position of each detected analyte on said surface withimproved accuracy by applying an optical distribution model to eachcluster of optical signals from said plurality of cycles; and (iii)deconvolve said optical signals in each field image from each cycleusing said determined relative position and a resolving function. 38.The system of claim 36, wherein said image processing module is furtherconfigured to determine an identity of said analytes deposited on saidsurface using said deconvolved optical signals.
 39. The system of claim36, wherein said optical image device comprises a moveable stagedefining a scannable area.
 40. The system of claim 36, wherein saidoptical image device comprises a sensor and optical magnificationconfigured to sample a surface of a substrate at below the diffractionlimit in said scannable area.
 41. The system of claim 36, furthercomprising a substrate comprising analytes deposited to an unpatternedsurface of the substrate at a center-to-center spacing below thediffraction limit.
 42. The system of claim 36, wherein said resolvingcomprises removing interfering optical signals from neighboring analytesusing a center-to-center distance between said neighboring analytes todetermine said relative positions of said neighboring analytes. 43.-79.(canceled)